You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by Raphael Cendrillon <ce...@gmail.com> on 2011/12/16 03:01:25 UTC

Re: Review Request: Support for Randomizing Input in SplitInput Class

-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-16 02:01:25.825802)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs
-----

  /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION 
  /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Re: Review Request: Support for Randomizing Input in SplitInput Class

Posted by Raphael Cendrillon <ce...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-23 23:14:34.869723)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Replaced IntWritable with WritableComparable so that any key class can be used. Added instantiation of Configuration to make sure tests pass when using SplitInputJob from within code


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1221886 
  /trunk/examples/bin/asf-email-examples.sh 1221886 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1221886 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1221886 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Re: Review Request: Support for Randomizing Input in SplitInput Class

Posted by Raphael Cendrillon <ce...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-22 16:03:47.932528)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Grant's changes


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Re: Review Request: Support for Randomizing Input in SplitInput Class

Posted by Raphael Cendrillon <ce...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-22 15:45:05.856640)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Grant's changes


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/core/src/main/java/org/apache/mahout/common/AbstractJob.java 1222286 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1222286 
  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1222286 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Re: Review Request: Support for Randomizing Input in SplitInput Class

Posted by Raphael Cendrillon <ce...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-22 04:38:08.261769)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Implemented downsampling more efficiently through mapper run(), implemented random permutation through sort comparator class, added driver, integrated into SplitInput


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInput.java 1215567 
  /trunk/integration/src/main/java/org/apache/mahout/utils/SplitInputJob.java PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/SplitInputTest.java 1215567 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael


Re: Review Request: Support for Randomizing Input in SplitInput Class

Posted by Raphael Cendrillon <ce...@gmail.com>.
-----------------------------------------------------------
This is an automatically generated e-mail. To reply, visit:
https://reviews.apache.org/r/3092/
-----------------------------------------------------------

(Updated 2011-12-16 19:09:13.382909)


Review request for mahout, Ted Dunning, lancenorskog, and Grant Ingersoll.


Changes
-------

Modified to accept any writable as the value (instead of just VectorWritable). This still requires the generic class PairWritable to be extended for each class of interest so that this extended class can be passed into setMapOutputValueClass(). I'm not sure if this is the best approach, any suggestions would be appreciated!


Summary
-------

Early support for randomizing input in SplitInput class. This is an early start but I've posted it up just to check if I'm on the right track.  A couple of comments:

  - currently the code runs through the entire file looking for the line corresponding to the random index. This has to be repeated for every line, which is slow and somewhat ugly.
  - the permutation indices are stored in an array. This could lead to scaling issues if the number of input lines is large. This problem may also exist with ridx in the existing code. One option is to use a linear feedback shift register to generate a permutation sequence on the fly.

Any suggestions would be very welcome!


This addresses bug MAHOUT-904.
    https://issues.apache.org/jira/browse/MAHOUT-904


Diffs (updated)
-----

  /trunk/integration/src/main/java/org/apache/mahout/utils/RandomPermuteJob.java PRE-CREATION 
  /trunk/integration/src/main/java/org/apache/mahout/utils/PairWritable.java PRE-CREATION 
  /trunk/integration/src/main/java/org/apache/mahout/utils/IntVectorWritable.java PRE-CREATION 
  /trunk/integration/src/test/java/org/apache/mahout/utils/TestRandomPermuteJob.java PRE-CREATION 

Diff: https://reviews.apache.org/r/3092/diff


Testing
-------


Thanks,

Raphael