You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Sean Owen (Created) (JIRA)" <ji...@apache.org> on 2011/12/02 17:47:39 UTC

[jira] [Created] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
-------------------------------------------------------------------------------------

                 Key: MAHOUT-910
                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
             Project: Mahout
          Issue Type: Improvement
          Components: Collaborative Filtering
    Affects Versions: 0.5
            Reporter: Sean Owen
            Assignee: Sean Owen
             Fix For: 0.6


Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162901#comment-13162901 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

I agree. Since we have three samplings here, the simplest thing is to expose three settings to control each of them individually.
Right now one setting is exposed, as two numbers. I could just clone (and rename) that pair of parameters so that they can be tuned individually. I could keep the 2-arg constructor and have it set all 3 pairs of parameters to the single pair given, to preserve a bit of backward-compatibility. (Behavior is still quite different though.)

I think we can endlessly add hook after hook and strategy inside strategies; at some point I'd probably say it's easier for you to maintain your own copy with particular behavior changes if you really need to. At least it'd be better to see it's opening up support for a broad class of use cases.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163085#comment-13163085 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

It's still computing some maximum (for each of three different things) and allowing everything if the number of things is less than the max, and only sampling if it exceeds the max. I think it's the same idea as before in this regard. Or are you questioning the default 'factor'? I picked 5. Right now if you have about 10,000 items, it will sample when a user exceeds 5*ln(10000) ~= 46 items.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Daniel Zohar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164227#comment-13164227 ] 

Daniel Zohar commented on MAHOUT-910:
-------------------------------------

I think the latest implementation looks great. It gives very fine control over the sampling.
However I do have to agree with Ted. In my code, I had to create a hack for users with few items. Otherwise I would get very little to no recommendations.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Attachment: SamplingCandidateItemsStrategy.java

It changed so much it might be easier to read the new source file. Here I've gone another way: three parameters to control, but they're all just the "factor" from Sebstian's previous code. It lets you specify limits of the form f*log(n), so limits are logarithmic in the number of items/users. How's this?

There's no overall cap since it is determined by these three values.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Attachment: MAHOUT-910.patch

This is what I'm proposing to increase sample-ability. Now sampling applies also to items per user, not just users.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sebastian Schelter (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164370#comment-13164370 ] 

Sebastian Schelter commented on MAHOUT-910:
-------------------------------------------

That would be great. I'd also suggest that only sampling the items per user should be the default behaviour.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Ted Dunning (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163080#comment-13163080 ] 

Ted Dunning commented on MAHOUT-910:
------------------------------------

Sounds like this will down-sample at least some items from users with few items.

I would recommend not eliminating any items from users with a small number of items and sampling to a hard limit above that.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Issue Comment Edited] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Daniel Zohar (Issue Comment Edited) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164227#comment-13164227 ] 

Daniel Zohar edited comment on MAHOUT-910 at 12/7/11 8:50 AM:
--------------------------------------------------------------

I think the latest implementation looks great. It gives very fine control over the sampling.
However I do have to agree with Ted. In my code, I had to create a hack for users with few items. Otherwise they would get very little to no recommendations.
                
      was (Author: danielz):
    I think the latest implementation looks great. It gives very fine control over the sampling.
However I do have to agree with Ted. In my code, I had to create a hack for users with few items. Otherwise I would get very little to no recommendations.
                  
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Lance Norskog (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161952#comment-13161952 ] 

Lance Norskog commented on MAHOUT-910:
--------------------------------------

{code}
return userIDs1.size() < userIDs2.size() ?
   userIDs2.intersectionSize(userIDs1) :
   userIDs1.intersectionSize(userIDs2);
{code}
Could this optimization be pushed into FastIDSet?

                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162899#comment-13162899 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

Daniel says:

Hi Sean,
I have been playing around with your patch. It looks good.
>From the little testing I did, I can also say that the recommendations seem
to be more accurate than in my initial proposal (#4).

I just have one suggestion though. I think the current parameters (int
defaultMaxPrefsPerItemConsidered, int userItemCountMultiplier) are not so
clear and don't give enough control over the sampling.
I would introduce two other parameters (it won't be backwards-compatible
though) -
- maxSourcePrefsConsidered: which will be used
in conjunction with SamplingLongPrimitiveIterator to do #1.
- maxFinalPrefs : which will set the value for 'int max' in your patch
(i.e. get rid of max = (int) Math.max(defaultMaxPrefsPerItemConsidered,
userItemCountMultiplier * Math.log(Math.max(dataModel.getNumUsers(),
dataModel.getNumItems()))); )

In the future it would be possible to add a strategy that will affect the
way maxSourcePrefsConsidered is sampled. For example, most recent items or
least recent items or random sampling (like we have now). Even though that
might not be the place to do so.. (since it's not in the context of the
user)

What do you think?
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)
    
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164392#comment-13164392 ] 

Hudson commented on MAHOUT-910:
-------------------------------

Integrated in Mahout-Quality #1233 (See [https://builds.apache.org/job/Mahout-Quality/1233/])
    MAHOUT-910 revamp sampling candidate strategy to expose sampling of items, items' users, and those users' items

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211377
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java
* /mahout/trunk/core/src/test/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategyTest.java

                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13162086#comment-13162086 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

Yeah you could; in a few cases the caller already knows the size or it can be included in another if statement, so I left it out.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Daniel Zohar (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164253#comment-13164253 ] 

Daniel Zohar commented on MAHOUT-910:
-------------------------------------

Ok, I see what you mean. Then I find the solution very suitable (at least for the problem I had.. :)
Do you plan to commit it anytime soon?
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Attachment: MAHOUT-910.patch

See comments on next file.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sebastian Schelter (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164348#comment-13164348 ] 

Sebastian Schelter commented on MAHOUT-910:
-------------------------------------------

Is it possible to get the same behavior as in MAHOUT-914 with the latest version?
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Status: Patch Available  (was: Open)
    
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164485#comment-13164485 ] 

Hudson commented on MAHOUT-910:
-------------------------------

Integrated in Mahout-Quality #1234 (See [https://builds.apache.org/job/Mahout-Quality/1234/])
    MAHOUT-910 merge ideas from MAHOUT-914, better docs, new no-limit arg, different defaults from Sebastian

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1211439
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/recommender/SamplingCandidateItemsStrategy.java

                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164352#comment-13164352 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

Yes, you would just set a very high value for the first and third limits, and set your desired limit for the second one. I think it's a superset of the original implementation and yours.

It occurs to me we need a reliable way to specify "no limit". And you're using log base 2 instead of e, which maybe makes more sense. Why don't I bake those two ideas in -- then I think it's the same thing?
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13164250#comment-13164250 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

Isn't this just a matter of setting the limits as you like? To be clear, any set smaller than a given limit is not sampled.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Updated] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Updated) (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-910:
-----------------------------

    Attachment: MAHOUT-910.patch

Now, samples all three things: user's item, those items' users, and those users' items.
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Hudson (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13161752#comment-13161752 ] 

Hudson commented on MAHOUT-910:
-------------------------------

Integrated in Mahout-Quality #1217 (See [https://builds.apache.org/job/Mahout-Quality/1217/])
    MAHOUT-910 prelude: commit some clear wins in optimizing calls to intersectionSize()

srowen : http://svn.apache.org/viewcvs.cgi/?root=Apache-SVN&view=rev&rev=1209577
Files : 
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/model/GenericBooleanPrefDataModel.java
* /mahout/trunk/core/src/main/java/org/apache/mahout/cf/taste/impl/similarity/TanimotoCoefficientSimilarity.java
* /mahout/trunk/integration/src/main/java/org/apache/mahout/cf/taste/impl/model/cassandra/CassandraDataModel.java

                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira

        

[jira] [Commented] (MAHOUT-910) Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations

Posted by "Sean Owen (Commented) (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-910?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=13163743#comment-13163743 ] 

Sean Owen commented on MAHOUT-910:
----------------------------------

If I've understood Ted right then I'm not hearing objection to taking this approach. We can modify it more later. Anyone mind if I get this in?
                
> Improve sampling in SamplingCandidateItemStrategy, optimize intersection computations
> -------------------------------------------------------------------------------------
>
>                 Key: MAHOUT-910
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-910
>             Project: Mahout
>          Issue Type: Improvement
>          Components: Collaborative Filtering
>    Affects Versions: 0.5
>            Reporter: Sean Owen
>            Assignee: Sean Owen
>             Fix For: 0.6
>
>         Attachments: MAHOUT-910.patch, MAHOUT-910.patch, MAHOUT-910.patch, SamplingCandidateItemsStrategy.java
>
>
> Per the lengthy discussion on the mailing list about optimizing SamplingCandidateItemStrategy and related code, I'm opening this placeholder issue.

--
This message is automatically generated by JIRA.
If you think it was sent incorrectly, please contact your JIRA administrators: https://issues.apache.org/jira/secure/ContactAdministrators!default.jspa
For more information on JIRA, see: http://www.atlassian.com/software/jira