You are viewing a plain text version of this content. The canonical link for it is here.
Posted to commits@cassandra.apache.org by "Jonathan Ellis (JIRA)" <ji...@apache.org> on 2011/02/17 19:36:24 UTC

[jira] Created: (CASSANDRA-2184) Returning split length of 0 confuses Pig

Returning split length of 0 confuses Pig
----------------------------------------

                 Key: CASSANDRA-2184
                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
             Project: Cassandra
          Issue Type: Bug
          Components: Hadoop
    Affects Versions: 0.6
            Reporter: Jonathan Ellis
            Priority: Minor
             Fix For: 0.7.3


Matt Kennedy reports on the user list,

bq. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. 
bq. The workaround is to disable this feature for jobs that use CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra script.
{noformat}

bq. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra.  My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.

I looked into actually returning the number of keys in the split but Hadoop javadoc says "Get the size of the split, so that the input splits can be sorted by size" so since our splits should be very very close in size this doesn't sound like it's worth doing an extra round trip to the host servers to get super accurate numbers on.  Returning MAX_VALUE seems like it's good enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Updated: (CASSANDRA-2184) Returning split length of 0 confuses Pig

Posted by "Jonathan Ellis (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jonathan Ellis updated CASSANDRA-2184:
--------------------------------------

    Remaining Estimate: 4h
     Original Estimate: 4h

> Returning split length of 0 confuses Pig
> ----------------------------------------
>
>                 Key: CASSANDRA-2184
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 0.7.3
>
>   Original Estimate: 4h
>  Remaining Estimate: 4h
>
> Matt Kennedy reports on the user list,
> bq. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. 
> bq. The workaround is to disable this feature for jobs that use CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra script.
> {noformat}
> bq. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra.  My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.
> I looked into actually returning the number of keys in the split but Hadoop javadoc says "Get the size of the split, so that the input splits can be sorted by size" so since our splits should be very very close in size this doesn't sound like it's worth doing an extra round trip to the host servers to get super accurate numbers on.  Returning MAX_VALUE seems like it's good enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Resolved: (CASSANDRA-2184) Returning split length of 0 confuses Pig

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/CASSANDRA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Brandon Williams resolved CASSANDRA-2184.
-----------------------------------------

    Resolution: Fixed
      Assignee: Brandon Williams

> Returning split length of 0 confuses Pig
> ----------------------------------------
>
>                 Key: CASSANDRA-2184
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 0.7.3
>
>
> Matt Kennedy reports on the user list,
> bq. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. 
> bq. The workaround is to disable this feature for jobs that use CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra script.
> {noformat}
> bq. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra.  My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.
> I looked into actually returning the number of keys in the split but Hadoop javadoc says "Get the size of the split, so that the input splits can be sorted by size" so since our splits should be very very close in size this doesn't sound like it's worth doing an extra round trip to the host servers to get super accurate numbers on.  Returning MAX_VALUE seems like it's good enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (CASSANDRA-2184) Returning split length of 0 confuses Pig

Posted by "Hudson (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996630#comment-12996630 ] 

Hudson commented on CASSANDRA-2184:
-----------------------------------

Integrated in Cassandra-0.7 #296 (See [https://hudson.apache.org/hudson/job/Cassandra-0.7/296/])
    Change split length from 0 to Long.MAX_VALUE
Patch by Matt Kennedy, reviewed by brandonwilliams for CASSANDRA-2184


> Returning split length of 0 confuses Pig
> ----------------------------------------
>
>                 Key: CASSANDRA-2184
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Jonathan Ellis
>            Assignee: Brandon Williams
>            Priority: Minor
>             Fix For: 0.7.3
>
>
> Matt Kennedy reports on the user list,
> bq. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. 
> bq. The workaround is to disable this feature for jobs that use CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra script.
> {noformat}
> bq. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra.  My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.
> I looked into actually returning the number of keys in the split but Hadoop javadoc says "Get the size of the split, so that the input splits can be sorted by size" so since our splits should be very very close in size this doesn't sound like it's worth doing an extra round trip to the host servers to get super accurate numbers on.  Returning MAX_VALUE seems like it's good enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] Commented: (CASSANDRA-2184) Returning split length of 0 confuses Pig

Posted by "Brandon Williams (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/CASSANDRA-2184?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12996619#comment-12996619 ] 

Brandon Williams commented on CASSANDRA-2184:
---------------------------------------------

Committed.

> Returning split length of 0 confuses Pig
> ----------------------------------------
>
>                 Key: CASSANDRA-2184
>                 URL: https://issues.apache.org/jira/browse/CASSANDRA-2184
>             Project: Cassandra
>          Issue Type: Bug
>          Components: Hadoop
>    Affects Versions: 0.6
>            Reporter: Jonathan Ellis
>            Priority: Minor
>             Fix For: 0.7.3
>
>
> Matt Kennedy reports on the user list,
> bq. There is a new feature in Pig 0.8 that will try to reduce the number of splits used to speed up the whole job.  Since the ColumnFamilyInputFormat lists the input size as zero, this feature eliminates all of the splits except for one. 
> bq. The workaround is to disable this feature for jobs that use CassandraStorage by setting -Dpig.splitCombination=false in the pig_cassandra script.
> {noformat}
> bq. However, we wanted to keep splitCombination on because it is a useful optimization for a lot of our use cases, so I went digging for the least intrusive way to keep the split combiner on, but also prevent it from combining splits that read from Cassandra.  My solution, which you are welcome to critique, is to change line 65 of http://svn.apache.org/viewvc/cassandra/trunk/src/java/org/apache/cassandra/hadoop/ColumnFamilySplit.java such that it returns Long.MAX_VALUE instead of zero.
> I looked into actually returning the number of keys in the split but Hadoop javadoc says "Get the size of the split, so that the input splits can be sorted by size" so since our splits should be very very close in size this doesn't sound like it's worth doing an extra round trip to the host servers to get super accurate numbers on.  Returning MAX_VALUE seems like it's good enough.

-- 
This message is automatically generated by JIRA.
-
For more information on JIRA, see: http://www.atlassian.com/software/jira