You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Ted Dunning (JIRA)" <ji...@apache.org> on 2009/12/07 06:24:18 UTC

[jira] Created: (MAHOUT-212) Need random sampler for use in reducers

Need random sampler for use in reducers
---------------------------------------

                 Key: MAHOUT-212
                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
             Project: Mahout
          Issue Type: Bug
          Components: Utils
    Affects Versions: 0.2
            Reporter: Ted Dunning
             Fix For: 0.3



For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.

As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786841#action_12786841 ] 

Sean Owen commented on MAHOUT-212:
----------------------------------

Sure, understood. How about putting them in the same place, and making them be named / look / act similarly? That brings me to a number of comments on the patch:

- I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common

- The Random shouldn't be instance variable right, and should be obtained from RandomUtils?

- It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper.

- Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable

- Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator?

- Are methods like copyInput() necessarily public, and is there a need to set the generator?

- Very picky, usually see test cases end in TestCase


If you agree with these but don't care t implement, I can do so.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787020#action_12787020 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------

bq.  I had suggested we not use both org.apache.mahout.common and org.apache.mahout.utils as the "common stuff" package, since that's redundant. We sort of standardized on common, but, retained utils for various reasons. I think this belongs in core/ and under .common

Sounds good.

bq. The Random shouldn't be instance variable right, and should be obtained from RandomUtils?

I like having it be injectable for testing purposes.  As long as it exhibits the same interface as j.u.Random, we should be fine.  There may be a better interface from RandomUtils.  Feel free to suggest one, but I really do want to keep the injectability of the generator.

bq. It's not necessary to keep the original Iterator since as you show, you really must sample it all upfront as you do. In this sense it's almost not properly a class that should produce an Iterator, but a List, but, I like the tidiness of an Iterator wrapper.

This is a point I waffled on.  The real question here is whether we care about the corner case where we don't read anything from the iterator.  I went slightly nuts and decided I did care to optimize that point, but you make a strong counter argument that the class could be simpler if copyInput were called from the constructor.  That would simplify testing as well.

bq. Consider providing an Iterable counterpart for easy use with foreach loops, like I did with SamplingIterable

Quite doable.

bq. Name it something ending with Iterator since it's an Iterator? FixedSizeSampleIterator?

Also a fine idea.

bq. Are methods like copyInput() necessarily public, and is there a need to set the generator?

They could be package level.  I merely exposed it to be able to do more detailed testing.  This adds weight to your argument about keeping the original iterator.

bq. Very picky, usually see test cases end in TestCase

I usually see test cases that start or end with Test.  It is an old convention from many ant builds that required regexes.  I don't much care except that I would have a small preference for making abstract tests end in TestCase in order to distinguish them from concrete tests.

bq. If you agree with these but don't care t implement, I can do so.

Let me take one more crack.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-212:
-------------------------------

    Status: Patch Available  (was: Open)


Code plus test cases.

Ready for use.  I think.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.3
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-212:
-------------------------------

    Attachment: MAHOUT-212.patch

Hmm... didn't get asked for where the patch file was when marking the bug as patch available. 

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-212:
-----------------------------

    Attachment: MAHOUT-212-C.patch

Ted I've gone big on this and collapsed a lot more iterator related things into this package, and tried to unify a little. It'll probably break any dependencies you have on it, but not badly. What say?

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212-C.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787157#action_12787157 ] 

Sean Owen commented on MAHOUT-212:
----------------------------------

What's the idea behind the Entry class? I am not seeing how that originalIndex is used.

Random generator = RandomUtils.getRandom();
This can be a private static final member.

The existing SamplingIterator could use the DelegatingIterator class too if you like.


I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random? testing what happens when it returns a value outside the interval or an Error maybe but is that reasonable. What flexibility would it meaningfully provide, at least in comparison to the extra method and new possible error scenarios.

statics have their place and this instance seems like a fine example to me. It's one of the few cases where I really do need to make sure every instance in the whole program works a certain way.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning updated MAHOUT-212:
-------------------------------

    Attachment: MAHOUT-212-b.patch

Here is the actual patch.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788262#action_12788262 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------


I am snowed under and won't get to this for at least a week.  Feel free to commit.


> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Updated: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen updated MAHOUT-212:
-----------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212-C.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning reassigned MAHOUT-212:
----------------------------------

    Assignee: Sean Owen  (was: Ted Dunning)

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786820#action_12786820 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------


Kinda existed, but SamplingIterator takes a sample rate which can't be known if you don't know the total size.  The FixedSizeSampler just takes a desired size and gives you exactly that many.

Merging them makes sense from the point of view of a user, but there is little in common between the implementations.


> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787033#action_12787033 ] 

Sean Owen commented on MAHOUT-212:
----------------------------------

Yeah test injection was the idea behind using RandomUtils, since it will return a generator that uses the same seed every time when set in test mode. The unit tests do (should) set it globally as such, to make sure the results are deterministic. Yes the returned generator is a MersenneTwisterRNG which just extends Random.

Yes anything for testing should probably be package-private.

(I'd also suggest making the instance fields private here? not sure there's a big case for extension, at least, one that isn't perhaps better answered with explicit getters)

I dont' care about the test naming convention.

Once this is in place I'll put my similar Iterator next to it.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788056#action_12788056 ] 

Sean Owen commented on MAHOUT-212:
----------------------------------

In case you're waiting on comment from me, as far as I am concerned you can submit.

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787169#action_12787169 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------

bq. What's the idea behind the Entry class? I am not seeing how that originalIndex is used.

It allows sorting of the results according the original ordering of the data elements.  The originalIndex is used in the comparison function.  Since we have to keep the original index to do this sort I needed a handy way to glue the index to the value.  My preference would have been a side array and a sort function that returns a permutation.  That would have allowed my two classes to merge completely with very little overhead.  Sorts that return permutations are commonly found in R or Matlab, but (to my knowledge) not in Java.

bq. Random generator = RandomUtils.getRandom();
This can be a private static final member.

It should actually be a local after making the change you suggested.  Sorry I missed that.

bq. The existing SamplingIterator could use the DelegatingIterator class too if you like.

After you, Gaston!

(seriously, my tiny window of time for coding just closed.  Feel free to take this and do anything you like) 

bq. I take the general point about injection but at some level a component isn't meaningfully injectable. What kind of fault would you inject in a Random? 

I have had a few cases where very rare, but legal, values from a generator would cause a fault.  It was easiest to provide a mock generator to stimulate these corners.  

As a concrete example, an exponential distribution can be generated using -log(u) where u is a random variable from (0, 1].  But most random number generators generator doubles from [0, 1) so the sampling should really be done using -log(1-u).  It is reasonable to inject a generator that returns a 0 to make sure that the edge condition is handled well.

What I would suggest is that if this is or becmes important, we can add a method to truly inject a generator.  It isn't important here so I left it out.



> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Assigned: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
     [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Ted Dunning reassigned MAHOUT-212:
----------------------------------

    Assignee: Ted Dunning

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Ted Dunning
>             Fix For: 0.3
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12786806#action_12786806 ] 

Sean Owen commented on MAHOUT-212:
----------------------------------

This kinda already existed as SamplingIterator -- does that do the same thing? could these be merged then, pulling the class into a common location and combining aspects of both?

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12788839#action_12788839 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------


Awesome.  Thanks.

I have no dependencies.  I simply wrote this to help you and anybody else with the cooccurrence stuff (you mentioned delaying the sampling until later).
 

> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212-b.patch, MAHOUT-212-C.patch, MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.


[jira] Commented: (MAHOUT-212) Need random sampler for use in reducers

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.
    [ https://issues.apache.org/jira/browse/MAHOUT-212?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12787143#action_12787143 ] 

Ted Dunning commented on MAHOUT-212:
------------------------------------


Here is another patch.  I moved SamplingIterator and Iterable.  I moved to eager sampling and simplified code and test cases as a result.  In the end, there is no additional functionality from an Iterable (it just wraps the Iterator in a way that it returns that same iterator) so I left it out.  

bq. Sean>   The Random shouldn't be instance variable right, and should be obtained from RandomUtils?

bq. ted> I like having it be injectable for testing purposes. As long as it exhibits the same interface as j.u.Random, we should be fine. There may be a better interface from RandomUtils. Feel free to suggest one, but I really do want to keep the injectability of the generator.

I switched to use RandomUtils, but I still think it is a poor form of injection.  Most notably, we can't inject faults using this approach, only a standard test sequence.  Besides, I hate global variables ( which is what a static class like this is ).  For these tests, those issues don't matter so I switched.





> Need random sampler for use in reducers
> ---------------------------------------
>
>                 Key: MAHOUT-212
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-212
>             Project: Mahout
>          Issue Type: Bug
>          Components: Utils
>    Affects Versions: 0.2
>            Reporter: Ted Dunning
>            Assignee: Sean Owen
>             Fix For: 0.3
>
>         Attachments: MAHOUT-212.patch
>
>
> For a variety of mining algorithms, it helps to have a uniform way to only process a sub-set of the records in a reducer.
> As such, I have written a simple generic sampler that filters an Iterator returning a fair sample of at most a specified size.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.