You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Mat Kelcey (Created) (JIRA)" <ji...@apache.org> on 2011/12/28 07:30:31 UTC

[jira] [Created] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Collocations Job Partitioner not being configured properly
----------------------------------------------------------

                 Key: MAHOUT-937
                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
             Project: Mahout
          Issue Type: Bug
            Reporter: Mat Kelcey
            Priority: Minor


The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 

This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 

The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 

I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

  @Override
  public int getPartition(GramKey key, Gram value, int numPartitions) {
    // exclude first byte which is the key type 
    byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
    System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
    int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
    return (hash & Integer.MAX_VALUE) % numPartitions;    
  }
  
}




--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Mat Kelcey (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-937:
------------------------------

    Description: 
The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 

This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 

The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 

I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

{code}
public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

  @Override
  public int getPartition(GramKey key, Gram value, int numPartitions) {
    // exclude first byte which is the key type 
    byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
    System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
    int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
    return (hash & Integer.MAX_VALUE) % numPartitions;    
  }
  
}
{code}




  was:
The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 

This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 

The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 

I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...

public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {

  @Override
  public int getPartition(GramKey key, Gram value, int numPartitions) {
    // exclude first byte which is the key type 
    byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
    System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
    int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
    return (hash & Integer.MAX_VALUE) % numPartitions;    
  }
  
}




    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: GramKeyPartitioner.java
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-937:
-----------------------------

        Fix Version/s: 0.6
             Assignee: Sean Owen
    Affects Version/s: 0.5
               Status: Patch Available  (was: Open)

I agree this looks like it can't be right. The methods aren't called, and the result hashes 0 bytes every time. I think the simplest thing is to even avoid copying the byte array; here's my proposed patch, which has the same result. Tests pass. I'll commit soon if there are no objections since it seems like this can only be a fix, based on the semantics the tests imply.
                
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: GramKeyPartitioner.java, MAHOUT-937.patch
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-937:
-----------------------------

    Attachment: MAHOUT-937.patch
    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: GramKeyPartitioner.java, MAHOUT-937.patch
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-937:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: GramKeyPartitioner.java, MAHOUT-937.patch
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Mat Kelcey (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-937:
------------------------------

    Attachment:     (was: GramKeyPartitioner.java)
    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: GramKeyPartitioner.java
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Mat Kelcey (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Mat Kelcey updated MAHOUT-937:
------------------------------

    Attachment: GramKeyPartitioner.java
    
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>            Reporter: Mat Kelcey
>            Priority: Minor
>         Attachments: GramKeyPartitioner.java
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Mat Kelcey (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13176691#comment-13176691 ] 

Mat Kelcey commented on MAHOUT-937:
-----------------------------------

I should have checked WritableComparator.hashBytes first, for some reason I had it in my head that the hashing was special. Thanks!
                
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: GramKeyPartitioner.java, MAHOUT-937.patch
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-937) Collocations Job Partitioner not being configured properly

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-937?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13177031#comment-13177031 ] 

Hudson commented on MAHOUT-937:
-------------------------------

Integrated in Mahout-Quality #1278 (See [https://builds.apache.org/job/Mahout-Quality/1278/])
    MAHOUT-937 make partitioner send to different reducers (as intended it seems) by just using the hash of primary bytes

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1225420
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/vectorizer/collocations/llr/GramKeyPartitioner.java

                
> Collocations Job Partitioner not being configured properly
> ----------------------------------------------------------
>
>                 Key: MAHOUT-937
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-937
>             Project: Mahout
>          Issue Type: Bug
>    Affects Versions: 0.5
>            Reporter: Mat Kelcey
>            Assignee: Sean Owen
>            Priority: Minor
>             Fix For: 0.6
>
>         Attachments: GramKeyPartitioner.java, MAHOUT-937.patch
>
>
> The first pass of the collocations discovery job (as described by CollocDriver.generateCollocations) uses the org.apache.mahout.vectorizer.collocations.llr.GramKeyPartitioner partitioner. 
> This partitoner has an instance variable offset that is supposed to be set by a call to setOffsets() but this call is never made (not sure why? is this method expected to be called by the Hadoop framework itself?) 
> The offset not being set results in getPartition always returning 0 and so all intermediate data is sent to the one reducer. 
> I couldn't quite understand what this partitioning was meant to be doing, but simply hashing the Grams primary string representation (ie without the leading 'type' byte) does what is required...
> {code}
> public class GramKeyPartitioner extends Partitioner<GramKey, Gram> {
>   @Override
>   public int getPartition(GramKey key, Gram value, int numPartitions) {
>     // exclude first byte which is the key type 
>     byte[] keyBytesWithoutTypeByte = new byte[key.getPrimaryLength()-1]; 
>     System.arraycopy(key.getBytes(), 1, keyBytesWithoutTypeByte, 0, keyBytesWithoutTypeByte.length); 
>     int hash = WritableComparator.hashBytes(keyBytesWithoutTypeByte, keyBytesWithoutTypeByte.length);
>     return (hash & Integer.MAX_VALUE) % numPartitions;    
>   }
>   
> }
> {code}

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira