You are viewing a plain text version of this content. The canonical link for it is here.

Posted to common-user@hadoop.apache.org by Grant Ingersoll <gs...@apache.org> on 2009/06/22 20:25:12 UTC

Interfaces/Implementations and Key/Values for M/R

Hi,

Over at Mahout (http://lucene.apache.org/mahout) we have a Vector  
interface with two implementations DenseVector and SparseVector.  When  
it comes to writing Mapper/Reducer, we have been able to just use  
Vector, but when it comes to actually binding real data via a  
Configuration, we need to specify, I think, the actual implementation  
being used, as in something like  
conf.setOutputValueClass(SparseVector.class);

Ideally, we'd like to avoid having to pick a particular implementation  
to as late as possible.  Right now, we've pushed this off to the user  
to pass in the implementation, but even that is less than ideal for a  
variety of reasons.  While we typically wouldn't expect the data to be  
a mixture of Dense and Sparse, there really shouldn't be a reason why  
it can't be.  We realize we could write out the class name to the  
DataOutput (we implement Writable) that causes us to have either hack  
some String compares in or use Class.forName(), which seems like it  
wouldn't perform well (although I admit I haven't tested that yet,  
presumably the JDK can cache the info)

Thanks,
Grant