You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Derek O'Callaghan <de...@ucd.ie> on 2010/10/01 10:46:53 UTC

Re: Standard Deviation of a Set of Vectors

Hi Jeff,

Thanks for the info on Canopy. In my case, given that I'm seeing better 
results with the MR version, I'll stick with that for now. I'd also be 
inclined to have the B option for consistency, although I get the 
feeling that not too many people are using the sequential version, so 
perhaps just documenting it is enough for now if there are higher 
priorities for 0.4.

Derek

On 30/09/10 18:31, Jeff Eastman wrote:
>  Derek,
>
> The Canopy implementation was probably one of the first Mahout 
> commits. Its reference implementation performs a single pass over the 
> data and, in your case, produces 128 canopies. It is the correct, 
> published Canopy algorithm. In order to become scalable, the MR 
> version does this in each mapper, and then again in the reducer to 
> combine the results of the mapper canopies. This approach was taken 
> from a Google presentation, iirc, and it seems to produce good 
> results. At least it has withstood the test of time.
>
> When I added the sequential execution mode to canopy, I just used the 
> existing reference implementation. Now you have noticed that the 
> results are quite different when running the MR version beside the 
> sequential version.
>
> I'm not sure which knob to turn here: A) try to modify the MR version 
> to perform a single pass; B) add another pass to the sequential 
> version; or C) just document the difference. A is a hard problem 
> (maybe 0.5) and B an easy change (ok for 0.4). Going for the "low 
> hanging fruit", I'm inclined to do B for consistency.
>
> Can we get some opinions on this from the other Mahouts?
>
> Jeff
>
> PS: On the usability of ClusterEvaluator.intraClusterDensity() (vs. 
> CDbwEvaluator.intraClusterDensity() I presume), I don't have an 
> opinion. Both are pretty experimental IMHO and I'd rather not use 
> "should" for either. It would be interesting to develop some standard 
> data sets against which to compare them both under all of the 
> clustering algorithms. Perhaps a nice wiki page or technical paper for 
> someone to write. I think both evaluators can give useful insight. 
> Again, pick your poison.
>
> On 9/30/10 12:36 PM, Derek O'Callaghan wrote:
>>
>>>> Thanks for the tip, I had been generating the representative points 
>>>> sequentially but was still using the MR versions of the clustering 
>>>> algorithms, I'll change that now.
>>> :)
>>
>> I just tried this, and there seems to be a difference in behaviour 
>> between the sequential and MR versions of Canopy. With MR:
>>
>>    * Mapper called for each point, which calls
>>      canopyClusterer.addPointToCanopies(point.get(), canopies); - in my
>>      case 128 canopies are created
>>    * Reducer called with the canopy centroid points, which then calls
>>      canopyClusterer.addPointToCanopies(point, canopies); for each of
>>      these centroids - and I end up with 11 canopies.
>>
>> And we end up with canopies of canopy centroids.
>>
>> However, the sequential version doesn't appear to have the equivalent 
>> of the Reducer steps, which means that it contains the original 
>> number of canopies. Should it also compute the "canopies of 
>> canopies"? At the moment, the MR version is working much better for 
>> me with the second canopy generation step, so I'll stick with this 
>> for now. I guess it should be consistent between sequential and MR? I 
>> should probably start a separate thread for this...
>>
>>
>>
>>>
>>> I guess I don't quite understand your question. Can you please 
>>> elaborate?
>>>
>>
>> Sorry, what I wanted to ask was: is it okay to use 
>> ClusterEvaluator.intraClusterDensity()? Or should only 
>> ClusterEvaluator.interClusterDensity() be used?
>>
>> I have to leave for the evening, but if you need me to check anything 
>> further here re: canopy I can take a look tomorrow.
>>
>

Re: Standard Deviation of a Set of Vectors

Posted by Ted Dunning <te...@gmail.com>.
If there isn't much demand here, I would just document the difference rather
than converge them.

On Fri, Oct 1, 2010 at 1:46 AM, Derek O'Callaghan
<de...@ucd.ie>wrote:

> I'd also be inclined to have the B option for consistency, although I get
> the feeling that not too many people are using the sequential version, so
> perhaps just documenting it is enough for now if there are higher priorities
> for 0.4.
>
> Derek
>
> On 30/09/10 18:31, Jeff Eastman wrote:
>
>>  Derek,
>>
>> The Canopy implementation was probably one of the first Mahout commits.
>> Its reference implementation performs a single pass over the data and, in
>> your case, produces 128 canopies. It is the correct, published Canopy
>> algorithm. In order to become scalable, the MR version does this in each
>> mapper, and then again in the reducer to combine the results of the mapper
>> canopies. This approach was taken from a Google presentation, iirc, and it
>> seems to produce good results. At least it has withstood the test of time.
>>
>> When I added the sequential execution mode to canopy, I just used the
>> existing reference implementation. Now you have noticed that the results are
>> quite different when running the MR version beside the sequential version.
>>
>> I'm not sure which knob to turn here: A) try to modify the MR version to
>> perform a single pass; B) add another pass to the sequential version; or C)
>> just document the difference. A is a hard problem (maybe 0.5) and B an easy
>> change (ok for 0.4). Going for the "low hanging fruit", I'm inclined to do B
>> for consistency.
>
>