You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@mahout.apache.org by "Drew Farris (JIRA)" <ji...@apache.org> on 2010/08/09 04:16:16 UTC

[jira] Updated: (MAHOUT-451) Simple utility to split bayes input into training/test sets

     [ https://issues.apache.org/jira/browse/MAHOUT-451?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Drew Farris updated MAHOUT-451:
-------------------------------

    Attachment: MAHOUT-451.patch

Thanks for the feedback Robin. Here's an updated patch that allows a variety of different ways of selecting a test set + resolves issue with OptionException not being printed. Class is now instance based instead of being a collection of static methods.

The composition of the test set is determined using one of the following approaches (setters also exposed via command-line arguments):

A contiguous set of items can be chosen from the input file(s) using the setTestSplitSize(int) or setTestSplitPct(int) methods. setTestSplitSize(int) allocates a fixed number of items, while setTestSplitPct(int) allocates a percentage of the original input, rounded up to the nearest integer. setSplitLocation(int) is used to control the position in the input from which the test data is extracted and is described in command-line help and the javadoc.

A random sampling of items can be chosen from the input files(s) using the setTestRandomSelectSize(int) or setTestRandomSelectionPct(int) methods, each choosing a fixed test set size or percentage of the input set size as described above. The RandomSampler class from <code>mahout-math</code> is used to create a sample of the appropriate size.




> Simple utility to split bayes input into training/test sets
> -----------------------------------------------------------
>
>                 Key: MAHOUT-451
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-451
>             Project: Mahout
>          Issue Type: New Feature
>          Components: Classification
>    Affects Versions: 0.3
>            Reporter: Drew Farris
>            Priority: Minor
>         Attachments: MAHOUT-451.patch, MAHOUT-451.patch
>
>
> Provides a simply utility that you point at a directory containing files in Bayes classifier input format. Given the number of documents to write to the test set, for each input file it will produce files in two output directories, one containing training data with the test documents removed and a second containing the test documents. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.