You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Jeff Eastman (JIRA)" <ji...@apache.org> on 2010/09/30 23:27:32 UTC

[jira] Created: (MAHOUT-516) Eigencuts produces unexpected results

Eigencuts produces unexpected results
-------------------------------------

                 Key: MAHOUT-516
                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
             Project: Mahout
          Issue Type: Bug
          Components: Clustering
    Affects Versions: 0.4
            Reporter: Jeff Eastman
             Fix For: 0.5


Shannon reports he suspects a logic error in Eigencuts since it evidently does not produce exactly the expected results. It passes all current unit tests so we need to characterize the results differences and produce a test for it. Marking for 0.5 for now though we will fix it as soon as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


Re: [jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by Ted Dunning <te...@gmail.com>.
Sounds more like new requirements requiring new behavior to me.

This is open source.  If you see a way to make the box better, you are free
to do so.

On Fri, Oct 22, 2010 at 5:10 AM, Shannon Quinn <sq...@gatech.edu> wrote:

> It would have to be in the LanczosSolver itself, though - seems like kind
> of a special case to be modifying the black box?
>
>
> On 10/22/2010 2:10 AM, Ted Dunning wrote:
>
>> There isn't a good way to guess eigenvalues.
>>
>> But the basic decomposer can see progressively more eigenvalues as it goes
>> and should be able to know when to stop.
>>
>> I can't speak to where in the code you would stick that, but it should be
>> reasonably easy to find.
>>
>> On Thu, Oct 21, 2010 at 5:33 PM, Shannon Quinn (JIRA)<jira@apache.org
>> >wrote:
>>
>>     [
>>>
>>> https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923711#action_12923711
>>> ]
>>>
>>> Shannon Quinn commented on MAHOUT-516:
>>> --------------------------------------
>>>
>>> For the time being, I'm just going to go with a -k type flag for
>>> specifying
>>> the degree of eigendecomposition. But here are my thoughts on a more
>>> permanent solution:
>>>
>>> In reading up a little further on the low-rank approximations of the
>>> eigencuts paper, it appears that, at least for images, the eigenvalues
>>> follow a linear decrease from 1, i.e. each corresponding eigenvalue is<=
>>> the previous according to some approximately linear function. In
>>> perturbing
>>> the flow of probability in the underlying markov transition graph (in
>>> order
>>> to determine where the clusters are), any eigenvalue/eigenvector pairs
>>> that
>>> fall under a certain threshold (specified by a combination of epsilon and
>>> beta, which are command-line arguments) are ignored. Thus, since the
>>> eigenvalues are monotonically decreasing, in theory we'd only need to
>>> find
>>> which eigenvalue falls beneath the threshold and perform a full
>>> decomposition up to that point.
>>>
>>> There's an obvious implementation problem there: we can't really know
>>> what
>>> that minimum degree is without performing a full decomposition in the
>>> first
>>> place. Is there a way around this? Do we have an efficient way of
>>> calculating, or perhaps approximating, eigenvalues without computing
>>> corresponding eigenvectors or otherwise performing a full decomposition?
>>> Maybe we could even do this probabilistically by "sampling" from the
>>> space
>>> of eigenvalues to make a guess on what rank we want? Just throwing ideas
>>> out
>>> here until the experts respond :)
>>>
>>>  Eigencuts produces unexpected results
>>>> -------------------------------------
>>>>
>>>>                 Key: MAHOUT-516
>>>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
>>>>             Project: Mahout
>>>>          Issue Type: Bug
>>>>          Components: Clustering
>>>>    Affects Versions: 0.4
>>>>            Reporter: Jeff Eastman
>>>>             Fix For: 0.5
>>>>
>>>>         Attachments: jeastman.vcf
>>>>
>>>>
>>>> Shannon reports he suspects a logic error in Eigencuts since it
>>>> evidently
>>>>
>>> does not produce exactly the expected results. It passes all current unit
>>> tests so we need to characterize the results differences and produce a
>>> test
>>> for it. Marking for 0.5 for now though we will fix it as soon as
>>> possible.
>>>
>>> --
>>> This message is automatically generated by JIRA.
>>> -
>>> You can reply to this email to add a comment to the issue online.
>>>
>>>
>>>
>

Re: [jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by Shannon Quinn <sq...@gatech.edu>.
It would have to be in the LanczosSolver itself, though - seems like 
kind of a special case to be modifying the black box?

On 10/22/2010 2:10 AM, Ted Dunning wrote:
> There isn't a good way to guess eigenvalues.
>
> But the basic decomposer can see progressively more eigenvalues as it goes
> and should be able to know when to stop.
>
> I can't speak to where in the code you would stick that, but it should be
> reasonably easy to find.
>
> On Thu, Oct 21, 2010 at 5:33 PM, Shannon Quinn (JIRA)<ji...@apache.org>wrote:
>
>>     [
>> https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923711#action_12923711]
>>
>> Shannon Quinn commented on MAHOUT-516:
>> --------------------------------------
>>
>> For the time being, I'm just going to go with a -k type flag for specifying
>> the degree of eigendecomposition. But here are my thoughts on a more
>> permanent solution:
>>
>> In reading up a little further on the low-rank approximations of the
>> eigencuts paper, it appears that, at least for images, the eigenvalues
>> follow a linear decrease from 1, i.e. each corresponding eigenvalue is<=
>> the previous according to some approximately linear function. In perturbing
>> the flow of probability in the underlying markov transition graph (in order
>> to determine where the clusters are), any eigenvalue/eigenvector pairs that
>> fall under a certain threshold (specified by a combination of epsilon and
>> beta, which are command-line arguments) are ignored. Thus, since the
>> eigenvalues are monotonically decreasing, in theory we'd only need to find
>> which eigenvalue falls beneath the threshold and perform a full
>> decomposition up to that point.
>>
>> There's an obvious implementation problem there: we can't really know what
>> that minimum degree is without performing a full decomposition in the first
>> place. Is there a way around this? Do we have an efficient way of
>> calculating, or perhaps approximating, eigenvalues without computing
>> corresponding eigenvectors or otherwise performing a full decomposition?
>> Maybe we could even do this probabilistically by "sampling" from the space
>> of eigenvalues to make a guess on what rank we want? Just throwing ideas out
>> here until the experts respond :)
>>
>>> Eigencuts produces unexpected results
>>> -------------------------------------
>>>
>>>                  Key: MAHOUT-516
>>>                  URL: https://issues.apache.org/jira/browse/MAHOUT-516
>>>              Project: Mahout
>>>           Issue Type: Bug
>>>           Components: Clustering
>>>     Affects Versions: 0.4
>>>             Reporter: Jeff Eastman
>>>              Fix For: 0.5
>>>
>>>          Attachments: jeastman.vcf
>>>
>>>
>>> Shannon reports he suspects a logic error in Eigencuts since it evidently
>> does not produce exactly the expected results. It passes all current unit
>> tests so we need to characterize the results differences and produce a test
>> for it. Marking for 0.5 for now though we will fix it as soon as possible.
>>
>> --
>> This message is automatically generated by JIRA.
>> -
>> You can reply to this email to add a comment to the issue online.
>>
>>


Re: [jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by Ted Dunning <te...@gmail.com>.
There isn't a good way to guess eigenvalues.

But the basic decomposer can see progressively more eigenvalues as it goes
and should be able to know when to stop.

I can't speak to where in the code you would stick that, but it should be
reasonably easy to find.

On Thu, Oct 21, 2010 at 5:33 PM, Shannon Quinn (JIRA) <ji...@apache.org>wrote:

>
>    [
> https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923711#action_12923711]
>
> Shannon Quinn commented on MAHOUT-516:
> --------------------------------------
>
> For the time being, I'm just going to go with a -k type flag for specifying
> the degree of eigendecomposition. But here are my thoughts on a more
> permanent solution:
>
> In reading up a little further on the low-rank approximations of the
> eigencuts paper, it appears that, at least for images, the eigenvalues
> follow a linear decrease from 1, i.e. each corresponding eigenvalue is <=
> the previous according to some approximately linear function. In perturbing
> the flow of probability in the underlying markov transition graph (in order
> to determine where the clusters are), any eigenvalue/eigenvector pairs that
> fall under a certain threshold (specified by a combination of epsilon and
> beta, which are command-line arguments) are ignored. Thus, since the
> eigenvalues are monotonically decreasing, in theory we'd only need to find
> which eigenvalue falls beneath the threshold and perform a full
> decomposition up to that point.
>
> There's an obvious implementation problem there: we can't really know what
> that minimum degree is without performing a full decomposition in the first
> place. Is there a way around this? Do we have an efficient way of
> calculating, or perhaps approximating, eigenvalues without computing
> corresponding eigenvectors or otherwise performing a full decomposition?
> Maybe we could even do this probabilistically by "sampling" from the space
> of eigenvalues to make a guess on what rank we want? Just throwing ideas out
> here until the experts respond :)
>
> > Eigencuts produces unexpected results
> > -------------------------------------
> >
> >                 Key: MAHOUT-516
> >                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
> >             Project: Mahout
> >          Issue Type: Bug
> >          Components: Clustering
> >    Affects Versions: 0.4
> >            Reporter: Jeff Eastman
> >             Fix For: 0.5
> >
> >         Attachments: jeastman.vcf
> >
> >
> > Shannon reports he suspects a logic error in Eigencuts since it evidently
> does not produce exactly the expected results. It passes all current unit
> tests so we need to characterize the results differences and produce a test
> for it. Marking for 0.5 for now though we will fix it as soon as possible.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12917858#action_12917858 ] 

Shannon Quinn commented on MAHOUT-516:
--------------------------------------

Aha. It's not a logic bug per se, but rather a hack I devised (and obviously forgot about): there's no explicit calculation done to determine the desiredRank for the eigen-decomposition, and there is no explicit "formula" for this in the original literature, either. This is an item I'll be asking Dr. Chennubhotla about tomorrow, but needless to say the stopgap measure I implemented during GSoC is nowhere close to scalable: it was to make desiredRank = dimensionality of input data (yes, very bad). We had been discussing the possibility of making this parameter tweak-able by the user (perhaps "suggestable", kind of along the lines of how Hadoop's # of M/R tasks can be suggested by the user), but the upshot is this algorithm still needs a reliable way to determine desiredRank in order to function properly.

> Eigencuts produces unexpected results
> -------------------------------------
>
>                 Key: MAHOUT-516
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>
> Shannon reports he suspects a logic error in Eigencuts since it evidently does not produce exactly the expected results. It passes all current unit tests so we need to characterize the results differences and produce a test for it. Marking for 0.5 for now though we will fix it as soon as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-516) Eigencuts produces unexpected results

Posted by "Jeff Eastman (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Jeff Eastman updated MAHOUT-516:
--------------------------------

    Attachment: jeastman.vcf

  Great that its not a bug. Sounds like a great candidate for an 
argument; like the k in k-means or the -r on SVD?




> Eigencuts produces unexpected results
> -------------------------------------
>
>                 Key: MAHOUT-516
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: jeastman.vcf
>
>
> Shannon reports he suspects a logic error in Eigencuts since it evidently does not produce exactly the expected results. It passes all current unit tests so we need to characterize the results differences and produce a test for it. Marking for 0.5 for now though we will fix it as soon as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12918571#action_12918571 ] 

Shannon Quinn commented on MAHOUT-516:
--------------------------------------

That's definitely an option, and probably the one to go with for the time being.

> Eigencuts produces unexpected results
> -------------------------------------
>
>                 Key: MAHOUT-516
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: jeastman.vcf
>
>
> Shannon reports he suspects a logic error in Eigencuts since it evidently does not produce exactly the expected results. It passes all current unit tests so we need to characterize the results differences and produce a test for it. Marking for 0.5 for now though we will fix it as soon as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-516) Eigencuts produces unexpected results

Posted by "Shannon Quinn (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-516?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12923711#action_12923711 ] 

Shannon Quinn commented on MAHOUT-516:
--------------------------------------

For the time being, I'm just going to go with a -k type flag for specifying the degree of eigendecomposition. But here are my thoughts on a more permanent solution:

In reading up a little further on the low-rank approximations of the eigencuts paper, it appears that, at least for images, the eigenvalues follow a linear decrease from 1, i.e. each corresponding eigenvalue is <= the previous according to some approximately linear function. In perturbing the flow of probability in the underlying markov transition graph (in order to determine where the clusters are), any eigenvalue/eigenvector pairs that fall under a certain threshold (specified by a combination of epsilon and beta, which are command-line arguments) are ignored. Thus, since the eigenvalues are monotonically decreasing, in theory we'd only need to find which eigenvalue falls beneath the threshold and perform a full decomposition up to that point.

There's an obvious implementation problem there: we can't really know what that minimum degree is without performing a full decomposition in the first place. Is there a way around this? Do we have an efficient way of calculating, or perhaps approximating, eigenvalues without computing corresponding eigenvectors or otherwise performing a full decomposition? Maybe we could even do this probabilistically by "sampling" from the space of eigenvalues to make a guess on what rank we want? Just throwing ideas out here until the experts respond :)

> Eigencuts produces unexpected results
> -------------------------------------
>
>                 Key: MAHOUT-516
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-516
>             Project: Mahout
>          Issue Type: Bug
>          Components: Clustering
>    Affects Versions: 0.4
>            Reporter: Jeff Eastman
>             Fix For: 0.5
>
>         Attachments: jeastman.vcf
>
>
> Shannon reports he suspects a logic error in Eigencuts since it evidently does not produce exactly the expected results. It passes all current unit tests so we need to characterize the results differences and produce a test for it. Marking for 0.5 for now though we will fix it as soon as possible.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.