You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Shannon Quinn <sq...@gatech.edu> on 2013/11/20 22:04:34 UTC

spectral clustering additions [was: Mahout 0.9 release]

On that note, I wanted to ask: what does everyone feel needs to be done 
to make the standard spectral clustering  robust enough to be considered 
a core algorithm? For me the biggest item was to have a job that 
computes the pairwise similarities required (I've recently started 
this), and I'd love to know what sort of input formats it should support 
for conversion to a similarity matrix. Is there anything else?

Eigencuts is another matter; I'm working on streamlining the data 
structures to make that more efficient.

-------- Original Message --------
Subject: 	Re: Mahout 0.9 release
Date: 	Wed, 20 Nov 2013 21:39:18 +0100
From: 	Isabel Drost-Fromm <is...@apache.org>
Reply-To: 	dev@mahout.apache.org
To: 	dev@mahout.apache.org



On Wed, 20 Nov 2013 10:32:42 -0800 (PST)
Suneel Marthi <su...@yahoo.com> wrote:

> We are presently targeting 0.9 for Dec 9.

Speaking of which: Any helping hand (be it on fixing issues, reviewing patches, adding to the documentation) is highly welcome to make this happen! If you are unsure what tasks exactly the project urgently needs help with do not be afraid to ask on the mailing list.


Isabel




Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Thu, 21 Nov 2013 10:39:48 +0100
Isabel Drost-Fromm <is...@apache.org> wrote:

> Do we have some documentation on spectral clustering?

Found some - can you please go over it and check that all is still correct, got completely exported and nothing a user getting started with this is missing?

http://mahout.staging.apache.org/users/clustering/spectral-clustering.html

Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Suneel Marthi <su...@yahoo.com>.
Shannon,

Also the existing Spectral KMeans still refers to deprecated DistributedLanczosSolver and EigenVerificationJob. It would be nice to fix that for 0.9 release.





On Thursday, November 21, 2013 4:03 PM, Suneel Marthi <su...@yahoo.com> wrote:
 
On #2, it would be good if could add Spectral KMeans to examples/bin/cluster-reuters.sh to process Reuters dataset.





On Thursday, November 21, 2013 3:50 PM, Shannon Quinn <sq...@gatech.edu> wrote:
 
Excellent. My todo list, then:

1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use
 it
3: code a job to process raw input into a similarity matrix (will create 
a JIRA for it)

I have a question for #3 that can be a separate thread; mainly, what are 
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:
> On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
> Suneel Marthi <su...@yahoo.com> wrote:
>
>> We are missing wiki docs for both Streaming kmeans and Spectral clustering.
>>
>> I can pull something together for streaming kmeans.
>>
>> Speaking of which we
 need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).
> Given that we are in the process of migrating substantial parts of our wiki to the main website soon to be hosted in Apache CMS it would be great if you could add your content there. See also MAHOUT-1245 and http://markmail.org/thread/5ixlclhlh3acgcoq for some details.
>
> Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Shannon Quinn <sq...@gatech.edu>.
That also gives me at least one answer for #3 :)

On 11/21/13, 4:03 PM, Suneel Marthi wrote:
> On #2, it would be good if could add Spectral KMeans to examples/bin/cluster-reuters.sh to process Reuters dataset.
>
>
>
>
>
> On Thursday, November 21, 2013 3:50 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>   
> Excellent. My todo list, then:
>
> 1: post docs for the algorithm on the Apache CMS
> 2: create an example to demonstrate how to use it
> 3: code a job to process raw input into a similarity matrix (will create
> a JIRA for it)
>
> I have a question for #3 that can be a separate thread; mainly, what are
> the primary input formats I should be concerned with processing?
>
>
> On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:
>> On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
>> Suneel Marthi <su...@yahoo.com> wrote:
>>
>>> We are missing wiki docs for both Streaming kmeans and Spectral clustering.
>>>
>>> I can pull something together for streaming kmeans.
>>>
>>> Speaking of which we need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).
>> Given that we are in the process of migrating substantial parts of our wiki to the main website soon to be hosted in Apache CMS it would be great if you could add your content there. See also MAHOUT-1245 and http://markmail.org/thread/5ixlclhlh3acgcoq for some details.
>>
>> Isabel


Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Suneel Marthi <su...@yahoo.com>.
On #2, it would be good if could add Spectral KMeans to examples/bin/cluster-reuters.sh to process Reuters dataset.





On Thursday, November 21, 2013 3:50 PM, Shannon Quinn <sq...@gatech.edu> wrote:
 
Excellent. My todo list, then:

1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create 
a JIRA for it)

I have a question for #3 that can be a separate thread; mainly, what are 
the primary input formats I should be concerned with processing?


On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:
> On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
> Suneel Marthi <su...@yahoo.com> wrote:
>
>> We are missing wiki docs for both Streaming kmeans and Spectral clustering.
>>
>> I can pull something together for streaming kmeans.
>>
>> Speaking of which we need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).
> Given that we are in the process of migrating substantial parts of our wiki to the main website soon to be hosted in Apache CMS it would be great if you could add your content there. See also MAHOUT-1245 and http://markmail.org/thread/5ixlclhlh3acgcoq for some details.
>
> Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Shannon Quinn <sq...@gatech.edu>.
Excellent. My todo list, then:

1: post docs for the algorithm on the Apache CMS
2: create an example to demonstrate how to use it
3: code a job to process raw input into a similarity matrix (will create 
a JIRA for it)

I have a question for #3 that can be a separate thread; mainly, what are 
the primary input formats I should be concerned with processing?

On 11/21/13, 1:09 PM, Isabel Drost-Fromm wrote:
> On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
> Suneel Marthi <su...@yahoo.com> wrote:
>
>> We are missing wiki docs for both Streaming kmeans and Spectral clustering.
>>
>> I can pull something together for streaming kmeans.
>>
>> Speaking of which we need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).
> Given that we are in the process of migrating substantial parts of our wiki to the main website soon to be hosted in Apache CMS it would be great if you could add your content there. See also MAHOUT-1245 and http://markmail.org/thread/5ixlclhlh3acgcoq for some details.
>
> Isabel


Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Thu, 21 Nov 2013 09:42:28 -0800 (PST)
Suneel Marthi <su...@yahoo.com> wrote:

> We are missing wiki docs for both Streaming kmeans and Spectral clustering.
> 
> I can pull something together for streaming kmeans.
> 
> Speaking of which we need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).

Given that we are in the process of migrating substantial parts of our wiki to the main website soon to be hosted in Apache CMS it would be great if you could add your content there. See also MAHOUT-1245 and http://markmail.org/thread/5ixlclhlh3acgcoq for some details.

Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Suneel Marthi <su...@yahoo.com>.
We are missing wiki docs for both Streaming kmeans and Spectral clustering.

I can pull something together for streaming kmeans.

Speaking of which we need to add a wiki page for Ted's t-digest once we figure out how it plays into Mahout (maybe as a measure of Streaming kmeans clustering, Ted??).



Sent from my iPhone


> On Nov 21, 2013, at 4:39 AM, Isabel Drost-Fromm <is...@apache.org> wrote:
> 
> On Wed, 20 Nov 2013 13:29:51 -0800 (PST)
> Suneel Marthi <su...@yahoo.com> wrote:
>> On Spectral clustering, please do add an example to examples/bin/cluster-reuters.sh.
> 
> 
> Do we have some documentation on spectral clustering? When going through what was formerly only the in the wiki yesterday I may have missed it but can't remember seeing something along the lines of "this is roughly how it's implemented", "this is how you use it from the command line" (can be as simple as the --help cmd line output), "this is the type of data and task to use it for" (can be a link to wikipedia) and "this is where in the JavaDoc you should get started if you want to tinker with the code". Maybe the INFRA wiki crawler didn't catch the relevant pages from the wiki?
> 
> Caveat: I'm missing that similar information for the work of Dan Filimon - also what we have for other implementations sometimes is missing information as well.
> 
> Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Isabel Drost-Fromm <is...@apache.org>.
On Wed, 20 Nov 2013 13:29:51 -0800 (PST)
Suneel Marthi <su...@yahoo.com> wrote:
> On Spectral clustering, please do add an example to examples/bin/cluster-reuters.sh.


Do we have some documentation on spectral clustering? When going through what was formerly only the in the wiki yesterday I may have missed it but can't remember seeing something along the lines of "this is roughly how it's implemented", "this is how you use it from the command line" (can be as simple as the --help cmd line output), "this is the type of data and task to use it for" (can be a link to wikipedia) and "this is where in the JavaDoc you should get started if you want to tinker with the code". Maybe the INFRA wiki crawler didn't catch the relevant pages from the wiki?

Caveat: I'm missing that similar information for the work of Dan Filimon - also what we have for other implementations sometimes is missing information as well.

Isabel

Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Shannon Quinn <sq...@gatech.edu>.
Right; I won't propose its re-integration until I'm confident it works 
as advertised. I'm referring to the "vanilla" spectral clustering that's 
still in Mahout.

An example sounds good, will do.

On 11/20/13, 4:29 PM, Suneel Marthi wrote:
> Shannon,
>
> Eigencuts has been deprecated and removed from the present codebase. Do we need to revert that?
>
> On Spectral clustering, please do add an example to examples/bin/cluster-reuters.sh.
>
>
>
>
>
> On Wednesday, November 20, 2013 4:05 PM, Shannon Quinn <sq...@gatech.edu> wrote:
>   
> On that note, I wanted to ask: what does everyone feel needs to be done
> to make the standard spectral clustering  robust enough to be considered
> a core algorithm? For me the biggest item was to have a job that
> computes the pairwise similarities required (I've recently started
> this), and I'd love to know what sort of input formats it should support
> for conversion to a similarity matrix. Is there anything else?
>
> Eigencuts is another matter; I'm working on streamlining the data
> structures to make that more efficient.
>
>
> -------- Original Message --------
> Subject:     Re: Mahout 0.9 release
> Date:     Wed, 20 Nov 2013 21:39:18 +0100
> From:     Isabel Drost-Fromm <is...@apache.org>
> Reply-To:     dev@mahout.apache.org
> To:     dev@mahout.apache.org
>
>
>
> On Wed, 20 Nov 2013 10:32:42 -0800 (PST)
> Suneel Marthi <su...@yahoo.com> wrote:
>
>> We are presently targeting 0.9 for Dec 9.
> Speaking of which: Any helping hand (be it on fixing issues, reviewing patches, adding to the documentation) is highly welcome to make this happen! If you are unsure what tasks exactly the project urgently needs help with do not be afraid to ask on the mailing list.
>
>
> Isabel


Re: spectral clustering additions [was: Mahout 0.9 release]

Posted by Suneel Marthi <su...@yahoo.com>.
Shannon,

Eigencuts has been deprecated and removed from the present codebase. Do we need to revert that?

On Spectral clustering, please do add an example to examples/bin/cluster-reuters.sh.





On Wednesday, November 20, 2013 4:05 PM, Shannon Quinn <sq...@gatech.edu> wrote:
 
On that note, I wanted to ask: what does everyone feel needs to be done 
to make the standard spectral clustering  robust enough to be considered 
a core algorithm? For me the biggest item was to have a job that 
computes the pairwise similarities required (I've recently started 
this), and I'd love to know what sort of input formats it should support 
for conversion to a similarity matrix. Is there anything else?

Eigencuts is another matter; I'm working on streamlining the data 
structures to make that more efficient.


-------- Original Message --------
Subject:     Re: Mahout 0.9 release
Date:     Wed, 20 Nov 2013 21:39:18 +0100
From:     Isabel Drost-Fromm <is...@apache.org>
Reply-To:     dev@mahout.apache.org
To:     dev@mahout.apache.org



On Wed, 20 Nov 2013 10:32:42 -0800 (PST)
Suneel Marthi <su...@yahoo.com> wrote:

> We are presently targeting 0.9 for Dec 9.

Speaking of which: Any helping hand (be it on fixing issues, reviewing patches, adding to the documentation) is highly welcome to make this happen! If you are unsure what tasks exactly the project urgently needs help with do not be afraid to ask on the mailing list.


Isabel