You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Markus Paaso (JIRA)" <ji...@apache.org> on 2012/08/13 08:45:37 UTC

[jira] [Commented] (MAHOUT-1055) Change id fields to use LongWritable instead of IntWritable

    [ https://issues.apache.org/jira/browse/MAHOUT-1055?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13432957#comment-13432957 ] 

Markus Paaso commented on MAHOUT-1055:
--------------------------------------

A workaround for class compatibility problem is to convert files with LongWritable to IntWritable:
{code}
public static void convertLongWritableToIntWritable(String inPathString, String outPathString) {
	Path inPath = new Path(inPathString);		
	Configuration conf = new Configuration();
	FileSystem fs = FileSystem.get(conf);
	SequenceFile.Reader sfr = new SequenceFile.Reader(fs, inPath, conf);		
	if(sfr.getKeyClass() == LongWritable.class && sfr.getValueClass() == VectorWritable.class) {
		Path outPath = new Path(outPathString);
		SequenceFile.Writer seqWriter = SequenceFile.createWriter(
			fs, conf, outPath, IntWritable, VectorWritable);
		LongWritable k = new LongWritable();
		VectorWritable v = new VectorWritable();
		while(sfr.next(k, v)) {
			long l = (int)k.value;
			if (l < Integer.MIN_VALUE || l > Integer.MAX_VALUE) {
				throw new IllegalArgumentException(l + " cannot be cast to int without changing its value.");
			}
			seqWriter.append(new IntWritable(l), v);
		}
		seqWriter.close();
	}
	sfr.close();
}
{code}
                
> Change id fields to use LongWritable instead of IntWritable
> -----------------------------------------------------------
>
>                 Key: MAHOUT-1055
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-1055
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.7
>            Reporter: Markus Paaso
>
> Why is IntWritable used as id field type in Mahout CVB? (org.apache.mahout.clustering.lda.cvb.CachingCVB0Mapper)
> Does Long have that significant impact on performance?
> Long is much more usable as id type and int causes compatibility issues like the one below.
> In method org.apache.mahout.utils.vectors.lucene.Driver.getSeqFileWriter() LongWritable is used correctly as id field type.
> I suggest that every IntWritable id should be changed to LongWritable.
> Sequencefile produced by command 'mahout lucene.vector' cannot be handled by command 'mahout cvb' due to this id type incompatibility issue.
> see http://mahout.markmail.org/thread/r3m6ojkpbzlxxizy

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira