You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jeremy Hanna (JIRA)" <ji...@apache.org> on 2011/07/04 17:54:21 UTC

[jira] [Created] (CASSANDRA-2855) Add hadoop support option to skip rows with empty columns

Add hadoop support option to skip rows with empty columns
---------------------------------------------------------

                 Key: CASSANDRA-2855
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
             Project: Cassandra
          Issue Type: Improvement
          Components: Hadoop
            Reporter: Jeremy Hanna
            Assignee: Jeremy Hanna


We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.

We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.

It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089045#comment-13089045 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

True - wouldn't matter.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.5
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "T Jake Luciani (Assigned) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani reassigned CASSANDRA-2855:
-----------------------------------------

    Assignee: T Jake Luciani  (was: Jeremy Hanna)
    
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-2855:
----------------------------------------

    Reviewer: brandon.williams

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.5
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "T Jake Luciani (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani updated CASSANDRA-2855:
--------------------------------------

    Attachment: v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
    
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13122016#comment-13122016 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

+1
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis reassigned CASSANDRA-2855:
-----------------------------------------

    Assignee:     (was: Jonathan Ellis)

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Add hadoop support option to skip rows with empty columns

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059972#comment-13059972 ] 

Jonathan Ellis commented on CASSANDRA-2855:
-------------------------------------------

I don't like the idea of adding flags to change behavior.

What I think we *could* do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row.  (Since, as soon as the tombstones expire, they will be gone anyway.)

> Add hadoop support option to skip rows with empty columns
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>              Labels: hadoop
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13060002#comment-13060002 ] 

Jonathan Ellis commented on CASSANDRA-2855:
-------------------------------------------

bq. is it more expensive/complicated to do it for an empty slice

empty result for entire row slice means it really will be gone when tombstone expires, so the two are semantically equivalent.  this is not the case for a smaller slice; an empty result for that could mean "there is data in the row, just not in the slice you requested."  so leaving that out would be an error.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146376#comment-13146376 ] 

Hudson commented on CASSANDRA-2855:
-----------------------------------

Integrated in Cassandra-0.8 #395 (See [https://builds.apache.org/job/Cassandra-0.8/395/])
    Revert CASSANDRA-2855

jake : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1199245
Files : 
* /cassandra/branches/cassandra-0.8/CHANGES.txt
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java

                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2855:
--------------------------------------

      Component/s:     (was: Hadoop)
                   API
         Priority: Minor  (was: Major)
    Fix Version/s: 0.8.2
         Assignee: Jonathan Ellis  (was: Jeremy Hanna)
          Summary: Skip rows with empty columns when slicing entire row  (was: Add hadoop support option to skip rows with empty columns)

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jonathan Ellis
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148007#comment-13148007 ] 

Hudson commented on CASSANDRA-2855:
-----------------------------------

Integrated in Cassandra-0.8 #398 (See [https://builds.apache.org/job/Cassandra-0.8/398/])
    Skip empty rows when entire row is requested, redux.
Patch by tjake, reviewed by brandonwilliams for CASSANDRA-2855

brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1200471
Files : 
* /cassandra/branches/cassandra-0.8/CHANGES.txt
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java

                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment: 2855-v3.txt

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855-v2.txt, 2855-v3.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Assigned] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna reassigned CASSANDRA-2855:
---------------------------------------

    Assignee: Jeremy Hanna

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13072557#comment-13072557 ] 

Jeremy Hanna edited comment on CASSANDRA-2855 at 7/28/11 10:16 PM:
-------------------------------------------------------------------

Added a configuration property cassandra.skip.empty.results which defaults to false.  We can't skip just complete empty rows because there is no way to tell if the complete row is empty based on a result from a slice predicate.

      was (Author: jeromatron):
    Added a configuration property cassandra.skip.empty.results which defaults to false.  We can't skip just complete empty rows because there is no way to tell if the complete row is empty based on a result that is a slice predicate.
  
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855-v2.txt, 2855-v3.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13148832#comment-13148832 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

fwiw - saw an interesting analogous ticket for hbase storage - https://issues.apache.org/jira/browse/PIG-2114
it talks about omitNulls and how it's used on the load and on the store side.
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "T Jake Luciani (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13146313#comment-13146313 ] 

T Jake Luciani commented on CASSANDRA-2855:
-------------------------------------------

Reverted will submit a new patch
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13070764#comment-13070764 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

Brandon was saying that empty slice comment only referred to core Cassandra, so in the CFRR I just skipped any key didn't have values - hoping that isSetColumns handles all cases for that.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams updated CASSANDRA-2855:
----------------------------------------

    Attachment: 2855-v5.txt

v5 removes the skip.empty.rows option making this behavior always-on.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.5
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Add hadoop support option to skip rows with empty columns

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059978#comment-13059978 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

{quote}
What I think we could do is not bother including empty rows in the resultset, IF we are doing a slice query for the entire row. (Since, as soon as the tombstones expire, they will be gone anyway.)
{quote}
Yeah - our primary concern is tombstones.  Would be great to get that done at a lower level.

> Add hadoop support option to skip rows with empty columns
> ---------------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: Hadoop
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>              Labels: hadoop
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13147879#comment-13147879 ] 

Brandon Williams commented on CASSANDRA-2855:
---------------------------------------------

+1
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13125980#comment-13125980 ] 

Hudson commented on CASSANDRA-2855:
-----------------------------------

Integrated in Cassandra-0.8 #368 (See [https://builds.apache.org/job/Cassandra-0.8/368/])
    Skip empty rows when slicing the entire row.
Patch by Jeremy Hanna and brandonwilliams, reviewed by Jeremy Hanna for
CASSANDRA-2855

brandonwilliams : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1182463
Files : 
* /cassandra/branches/cassandra-0.8/CHANGES.txt
* /cassandra/branches/cassandra-0.8/src/java/org/apache/cassandra/hadoop/ColumnFamilyRecordReader.java

                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams resolved CASSANDRA-2855.
-----------------------------------------

    Resolution: Fixed

Committed.
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment: 2855.txt

Simple patch to skip results that have no values for the key.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13078912#comment-13078912 ] 

Brandon Williams commented on CASSANDRA-2855:
---------------------------------------------

skip.empty.results should probably be 'skip.empty.rows' or 'skip.tombstones' and there needs to be a check on the predicate to see if it covers the entire row, and if so suppress the tombstone, but if not return the empty slice.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.4
>
>         Attachments: 2855-v2.txt, 2855-v3.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment:     (was: 2855-v3.txt)

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855-v2.txt, 2855-v3.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "T Jake Luciani (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani updated CASSANDRA-2855:
--------------------------------------


next attempt
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment: 2855-v4.txt

v4 updates the config var to cassandra.skip.empty.rows and only does so if the slice predicate is empty.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.4
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059995#comment-13059995 ] 

Jeremy Hanna commented on CASSANDRA-2855:
-----------------------------------------

is it more expensive/complicated to do it for an empty slice or is that just orthogonal to this since that is handled in a different place?

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jonathan Ellis
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment:     (was: 2855.txt)

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Resolved] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Brandon Williams (Resolved) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams resolved CASSANDRA-2855.
-----------------------------------------

    Resolution: Fixed

Committed
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment: 2855-v2.txt

v2 is tested to skip results with no columns and tombstones.  Also fixed where an exception would occur because lastRow looked at the altered set of rows.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855-v2.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jeremy Hanna (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeremy Hanna updated CASSANDRA-2855:
------------------------------------

    Attachment: 2855-v3.txt

Added a configuration property cassandra.skip.empty.results which defaults to false.  We can't skip just complete empty rows because there is no way to tell if the complete row is empty based on a result that is a slice predicate.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.7.9, 0.8.3
>
>         Attachments: 2855-v2.txt, 2855-v3.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Reopened] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "T Jake Luciani (Reopened) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

T Jake Luciani reopened CASSANDRA-2855:
---------------------------------------


marking re-opened
                
> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: T Jake Luciani
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.8
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt, 2855-v5.txt, v1-0001-CASSANDRA-2855-ignore-ghosts-when-no-predicate-specifi.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13089040#comment-13089040 ] 

Jonathan Ellis commented on CASSANDRA-2855:
-------------------------------------------

If we are only only skipping when it the predicate covers the entire row (which is the Right Thing imo), why do we need the configuration setting?  Can't we make it just always skip?  Look at it this way: you're giving the same result that the user would see anyway if he had a lower tombstone grace.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jeremy Hanna
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.5
>
>         Attachments: 2855-v2.txt, 2855-v3.txt, 2855-v4.txt
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (CASSANDRA-2855) Skip rows with empty columns when slicing entire row

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2855?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13059996#comment-13059996 ] 

Jonathan Ellis commented on CASSANDRA-2855:
-------------------------------------------

sylvain points out that doing this at the Thrift layer would break the row count contract.  We could still do this at the CFRR level.

> Skip rows with empty columns when slicing entire row
> ----------------------------------------------------
>
>                 Key: CASSANDRA-2855
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2855
>             Project: Cassandra
>          Issue Type: Improvement
>          Components: API
>            Reporter: Jeremy Hanna
>            Assignee: Jonathan Ellis
>            Priority: Minor
>              Labels: hadoop
>             Fix For: 0.8.2
>
>
> We have been finding that range ghosts appear in results from Hadoop via Pig.  This could also happen if rows don't have data for the slice predicate that is given.  This leads to having to do a painful amount of defensive checking on the Pig side, especially in the case of range ghosts.
> We would like to add an option to skip rows that have no column values in it.  That functionality existed before in core Cassandra but was removed because of the performance penalty of that checking.  However with Hadoop support in the RecordReader, that is batch oriented anyway, so individual row reading performance isn't as much of an issue.  Also we would make it an optional config parameter for each job anyway, so people wouldn't have to incur that penalty if they are confident that there won't be those empty rows or they don't care.
> It could be parameter cassandra.skip.empty.rows and be true/false.

--
This message is automatically generated by JIRA.
For more information on JIRA, see: http://www.atlassian.com/software/jira