You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Jeff Eastman <jd...@windwardsolutions.com> on 2009/06/16 21:56:44 UTC

MAHOUT-65

You gonna commit your patch? I agree with shortening the class name in 
the JsonVectorAdapter and will do it once you commit ur stuff.
Jeff

Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
oh, wow, nevermind. Vector implements writable.

Sorry everyone.

-- David

On Thu, Jun 18, 2009 at 12:19 PM, David Hall<dl...@cs.stanford.edu> wrote:
> actually, it looks like someone went to all the trouble to make both
> SparseVector and DenseVector have all the methods required by
> Writable, but they don't implement Writable.
>
> Could I just make Vector extend Writable?
>
> -- David
>
> On Thu, Jun 18, 2009 at 12:01 PM, David Hall<dl...@cs.stanford.edu> wrote:
>> following up on my earlier email.
>>
>> Would anyone be interested in a "compressed" serialization for
>> DenseVector/SparseVector that follows in the vein of
>> hadoop.io.Writable? The space overhead for gson (parsing issues
>> not-withstanding) is pretty high, and it wouldn't be terribly hard to
>> implement a high-performance thing for vectors.
>>
>> -- David
>>
>> On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>>> +1, you added name constructors that I didn't have and the equals/equivalent
>>> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>>>
>>>
>>> Grant Ingersoll wrote:
>>>>
>>>> Shall I take that as approval of the approach?
>>>>
>>>> BTW, the Gson stuff seems like a winner for serialization.
>>>>
>>>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>>>>
>>>>> You gonna commit your patch? I agree with shortening the class name in
>>>>> the JsonVectorAdapter and will do it once you commit ur stuff.
>>>>> Jeff
>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>

Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
actually, it looks like someone went to all the trouble to make both
SparseVector and DenseVector have all the methods required by
Writable, but they don't implement Writable.

Could I just make Vector extend Writable?

-- David

On Thu, Jun 18, 2009 at 12:01 PM, David Hall<dl...@cs.stanford.edu> wrote:
> following up on my earlier email.
>
> Would anyone be interested in a "compressed" serialization for
> DenseVector/SparseVector that follows in the vein of
> hadoop.io.Writable? The space overhead for gson (parsing issues
> not-withstanding) is pretty high, and it wouldn't be terribly hard to
> implement a high-performance thing for vectors.
>
> -- David
>
> On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>> +1, you added name constructors that I didn't have and the equals/equivalent
>> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>>
>>
>> Grant Ingersoll wrote:
>>>
>>> Shall I take that as approval of the approach?
>>>
>>> BTW, the Gson stuff seems like a winner for serialization.
>>>
>>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>>>
>>>> You gonna commit your patch? I agree with shortening the class name in
>>>> the JsonVectorAdapter and will do it once you commit ur stuff.
>>>> Jeff
>>>
>>>
>>>
>>>
>>
>>
>

Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> Shall I change the method to asWritable()?

I'd just be for getting rid of it. Vector implements Writable, so
asWritable() could just be "return this;", which seems gratuitous

As for actual efficiency:
   lucene/mahout/trunk/core/src/main/java/org/apache/mahout/clustering/meanshift/MeanShiftCanopy.java

is currently dumping output values as the text strings. If there's a
standard dataset, that would be an easy place to do the test.

- David

> I don't know of any situations where Vectors are used as keys. It hardly
> makes sense to use them as they are so unwieldy. Suggest we could change to
> just Writable and be ahead. In terms of the potential density improvement,
> it will be interesting to see what can typically be achieved.
>
> r786323 just removed all calls to asWritableComparable, replacing them with
> asFormatString which was correct anyway.
>

>
> Jeff
>
> David Hall wrote:
>>
>> How often does Mahout need the "Comparable" part for Vectors? Are
>> vectors commonly used as map output keys?
>>
>> In terms of space efficiency, I'd bet it's probably a bit better than
>> a factor of two in the average case, especially for densevectors. The
>> gson format is storing both the int index and the double as raw
>> strings, plus whatever boundary characters.  The writable
>> implementation stores just the bytes of the double, plus a length.
>>
>> -- David
>>
>> On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman<jd...@windwardsolutions.com>
>> wrote:
>>
>>>
>>> +1 asWritableComparable is a simple implementation that uses
>>> asFormatString.
>>> It would be good to rewrite it for internal communication. A factor of
>>> two
>>> is still a factor of two.
>>>
>>> Jeff
>>>
>>>
>>> Grant Ingersoll wrote:
>>>
>>>>
>>>> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:
>>>>
>>>>
>>>>>
>>>>> Writable should be plenty!
>>>>>
>>>>>
>>>>
>>>> +1.  Still nice to have JSON for user facing though.
>>>>
>>>>
>>>>>
>>>>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu>
>>>>> wrote:
>>>>>
>>>>>
>>>>>>
>>>>>> See my followup on another thread (sorry for the schizophrenic
>>>>>> posting); Vector already implements Writable, so that's all I really
>>>>>> can ask of it. Is there something more you'd like? I'd be happy to do
>>>>>> it.
>>>>>>
>>>>>>
>>>>>>
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>>
>
>

Re: MAHOUT-65

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
Er, um, I see what you mean. How about just deleting the method? What 
really needs doing then is for all of the various clusters to themselves 
implement Writable so that they don't need to call asFormatString but 
can just emit themselves.
Jeff




Ted Dunning wrote:
> What does this method do?
>
> If the vector already implements Writable, what is the purpose of a
> conversion?
>
> On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:
>
>   
>> Shall I change the method to asWritable()?
>>     
>
>
>
>
>   


Re: MAHOUT-65

Posted by Ted Dunning <te...@gmail.com>.
What does this method do?

If the vector already implements Writable, what is the purpose of a
conversion?

On Thu, Jun 18, 2009 at 3:39 PM, Jeff Eastman <jd...@windwardsolutions.com>wrote:

> Shall I change the method to asWritable()?




-- 
Ted Dunning, CTO
DeepDyve

Re: MAHOUT-65

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
I don't know of any situations where Vectors are used as keys. It hardly 
makes sense to use them as they are so unwieldy. Suggest we could change 
to just Writable and be ahead. In terms of the potential density 
improvement, it will be interesting to see what can typically be achieved.

r786323 just removed all calls to asWritableComparable, replacing them 
with asFormatString which was correct anyway.

Shall I change the method to asWritable()?

Jeff

David Hall wrote:
> How often does Mahout need the "Comparable" part for Vectors? Are
> vectors commonly used as map output keys?
>
> In terms of space efficiency, I'd bet it's probably a bit better than
> a factor of two in the average case, especially for densevectors. The
> gson format is storing both the int index and the double as raw
> strings, plus whatever boundary characters.  The writable
> implementation stores just the bytes of the double, plus a length.
>
> -- David
>
> On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
>   
>> +1 asWritableComparable is a simple implementation that uses asFormatString.
>> It would be good to rewrite it for internal communication. A factor of two
>> is still a factor of two.
>>
>> Jeff
>>
>>
>> Grant Ingersoll wrote:
>>     
>>> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:
>>>
>>>       
>>>> Writable should be plenty!
>>>>
>>>>         
>>> +1.  Still nice to have JSON for user facing though.
>>>
>>>       
>>>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>>
>>>>         
>>>>> See my followup on another thread (sorry for the schizophrenic
>>>>> posting); Vector already implements Writable, so that's all I really
>>>>> can ask of it. Is there something more you'd like? I'd be happy to do
>>>>> it.
>>>>>
>>>>>
>>>>>           
>>>
>>>
>>>       
>>     
>
>
>   


Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
How often does Mahout need the "Comparable" part for Vectors? Are
vectors commonly used as map output keys?

In terms of space efficiency, I'd bet it's probably a bit better than
a factor of two in the average case, especially for densevectors. The
gson format is storing both the int index and the double as raw
strings, plus whatever boundary characters.  The writable
implementation stores just the bytes of the double, plus a length.

-- David

On Thu, Jun 18, 2009 at 2:13 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> +1 asWritableComparable is a simple implementation that uses asFormatString.
> It would be good to rewrite it for internal communication. A factor of two
> is still a factor of two.
>
> Jeff
>
>
> Grant Ingersoll wrote:
>>
>> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:
>>
>>> Writable should be plenty!
>>>
>>
>> +1.  Still nice to have JSON for user facing though.
>>
>>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu> wrote:
>>>
>>>> See my followup on another thread (sorry for the schizophrenic
>>>> posting); Vector already implements Writable, so that's all I really
>>>> can ask of it. Is there something more you'd like? I'd be happy to do
>>>> it.
>>>>
>>>>
>>
>>
>>
>>
>
>

Re: MAHOUT-65

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1 asWritableComparable is a simple implementation that uses 
asFormatString. It would be good to rewrite it for internal 
communication. A factor of two is still a factor of two.

Jeff


Grant Ingersoll wrote:
>
> On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:
>
>> Writable should be plenty!
>>
>
> +1.  Still nice to have JSON for user facing though.
>
>> On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu> 
>> wrote:
>>
>>> See my followup on another thread (sorry for the schizophrenic
>>> posting); Vector already implements Writable, so that's all I really
>>> can ask of it. Is there something more you'd like? I'd be happy to do
>>> it.
>>>
>>>
>
>
>
>


Re: MAHOUT-65

Posted by Grant Ingersoll <gs...@apache.org>.
On Jun 18, 2009, at 4:45 PM, Ted Dunning wrote:

> Writable should be plenty!
>

+1.  Still nice to have JSON for user facing though.

> On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu>  
> wrote:
>
>> See my followup on another thread (sorry for the schizophrenic
>> posting); Vector already implements Writable, so that's all I really
>> can ask of it. Is there something more you'd like? I'd be happy to do
>> it.
>>
>>



Re: MAHOUT-65

Posted by Ted Dunning <te...@gmail.com>.
Writable should be plenty!

On Thu, Jun 18, 2009 at 1:15 PM, David Hall <dl...@cs.stanford.edu> wrote:

> See my followup on another thread (sorry for the schizophrenic
> posting); Vector already implements Writable, so that's all I really
> can ask of it. Is there something more you'd like? I'd be happy to do
> it.
>
>

Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
See my followup on another thread (sorry for the schizophrenic
posting); Vector already implements Writable, so that's all I really
can ask of it. Is there something more you'd like? I'd be happy to do
it.

-- David

On Thu, Jun 18, 2009 at 1:11 PM, Ted Dunning<te...@gmail.com> wrote:
> +10!!!
>
> How would you like to do it?  Something like avro?  Thrift?  Homespun?
>
> On Thu, Jun 18, 2009 at 12:01 PM, David Hall <dl...@cs.stanford.edu> wrote:
>
>> Would anyone be interested in a "compressed" serialization for
>> DenseVector/SparseVector that follows in the vein of
>> hadoop.io.Writable? The space overhead for gson (parsing issues
>> not-withstanding) is pretty high, and it wouldn't be terribly hard to
>> implement a high-performance thing for vectors.
>>
>

Re: MAHOUT-65

Posted by Ted Dunning <te...@gmail.com>.
+10!!!

How would you like to do it?  Something like avro?  Thrift?  Homespun?

On Thu, Jun 18, 2009 at 12:01 PM, David Hall <dl...@cs.stanford.edu> wrote:

> Would anyone be interested in a "compressed" serialization for
> DenseVector/SparseVector that follows in the vein of
> hadoop.io.Writable? The space overhead for gson (parsing issues
> not-withstanding) is pretty high, and it wouldn't be terribly hard to
> implement a high-performance thing for vectors.
>

Re: MAHOUT-65

Posted by David Hall <dl...@cs.stanford.edu>.
following up on my earlier email.

Would anyone be interested in a "compressed" serialization for
DenseVector/SparseVector that follows in the vein of
hadoop.io.Writable? The space overhead for gson (parsing issues
not-withstanding) is pretty high, and it wouldn't be terribly hard to
implement a high-performance thing for vectors.

-- David

On Tue, Jun 16, 2009 at 1:39 PM, Jeff Eastman<jd...@windwardsolutions.com> wrote:
> +1, you added name constructors that I didn't have and the equals/equivalent
> stuff. Ya, Gson makes it all pretty trivial once you grok it.
>
>
> Grant Ingersoll wrote:
>>
>> Shall I take that as approval of the approach?
>>
>> BTW, the Gson stuff seems like a winner for serialization.
>>
>> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>>
>>> You gonna commit your patch? I agree with shortening the class name in
>>> the JsonVectorAdapter and will do it once you commit ur stuff.
>>> Jeff
>>
>>
>>
>>
>
>

Re: MAHOUT-65

Posted by Jeff Eastman <jd...@windwardsolutions.com>.
+1, you added name constructors that I didn't have and the 
equals/equivalent stuff. Ya, Gson makes it all pretty trivial once you 
grok it.


Grant Ingersoll wrote:
> Shall I take that as approval of the approach?
>
> BTW, the Gson stuff seems like a winner for serialization.
>
> On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:
>
>> You gonna commit your patch? I agree with shortening the class name 
>> in the JsonVectorAdapter and will do it once you commit ur stuff.
>> Jeff
>
>
>
>


Re: MAHOUT-65

Posted by Grant Ingersoll <gs...@apache.org>.
Shall I take that as approval of the approach?

BTW, the Gson stuff seems like a winner for serialization.

On Jun 16, 2009, at 3:56 PM, Jeff Eastman wrote:

> You gonna commit your patch? I agree with shortening the class name  
> in the JsonVectorAdapter and will do it once you commit ur stuff.
> Jeff