You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Karl Wettin (JIRA)" <ji...@apache.org> on 2008/06/21 02:25:45 UTC

[jira] Created: (MAHOUT-61) Text classification matrix

Text classification matrix 
---------------------------

                 Key: MAHOUT-61
                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
             Project: Mahout
          Issue Type: New Feature
            Reporter: Karl Wettin
            Priority: Minor




-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611763#action_12611763 ] 

Karl Wettin commented on MAHOUT-61:
-----------------------------------

The patch requires lucene 2.4-dev core, analyzers and snowball. You have to build that from the Lucene trunk. I could attach binaries here too though.

class TwentyNewsGroups is what I've used to test the package. It requires 20news-bydate to be extracted in DFS. 

http://people.csail.mit.edu/jrennie/20Newsgroups/20news-bydate.tar.gz

I've only been running this in local mode from my development environment:

{code:java}
public class TwentyNewsGroups extends InstanceHandler {
  public static void main(String[] args) throws Exception {
    TokenMatrixBuilderDriver.main(new String[]{
        "instanceHandlerClass=" + TwentyNewsGroups.class.getName(),
        "dfsRootPath=20news",
        "instancesInputPath=20news/20news-bydate/20news-bydate-train"
    });
  }
{code}


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12611732#action_12611732 ] 

Karl Wettin commented on MAHOUT-61:
-----------------------------------

I suppose next step is to pass on the data to some algorithm. I'm going to start with MAHOUT-19.

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608839#action_12608839 ] 

karl.wettin edited comment on MAHOUT-61 at 6/27/08 10:00 AM:
-------------------------------------------------------------

M/R version of previous patch. 

The only thing it does is to compile. I'll be replacing the todos with code soon enough. It is still Maven only!

One thing I'm not quite certain about how to solve is how to handle features that are class values, for instance the news group when parsing 20NewsGroups.

Comments most appreciated. 

      was (Author: karl.wettin):
    M/R version of previous patch. 

The only thing it does is to compile. I'll be replacing the todos with code soon enough. 

One thing I'm not quite certain about how to solve is how to handle features that are class values, for instance the news group when parsing 20NewsGroups.

Comments most appreciated. 
  
> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-61:
------------------------------

    Attachment: MAHOUT-61.txt

This is what it is now:

 1. InstanceHandler gathers instances
 2. TokenizationMapper, Reducer and Combiner create one intermediate MapWritiable instance (see [4]). These are reduced down to unique feature names and class values.
 3. The features and class values are placed in maps, assigned column index and numeric values,  and stored as MapFile on DFS.
 4. VectorBuilderMapper is a Mapping only job that use the results from [2] and [3] to produce sparse vectors.


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-61:
------------------------------

       Assignee: Karl Wettin
    Description: 
A set of classes that builds matrices from text.

Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.

PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.
        Summary: Text problem matrix builder   (was: Text classification matrix )

Oups, I hit enter a bit too early

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12607008#action_12607008 ] 

Karl Wettin commented on MAHOUT-61:
-----------------------------------

It just hit me that this should of course be an MR-job. I suppose it would have to be divided in two runs, one string feature extraction run and one string feature to matrix column index run. Or?


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606947#action_12606947 ] 

Karl Wettin commented on MAHOUT-61:
-----------------------------------

Tokenization is currently more of a classification problem than a clusterer problem solver. I wanted to add shingles but could not find the class in the lucene dists? Not even in a snapshot.

So far this code just creates a matrix. I needs to be written to file so it can be read by the algorithms that wants to use it. I have not really tested this, it is an early beta just to get some feedback.

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Issue Comment Edited: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608463#action_12608463 ] 

karl.wettin edited comment on MAHOUT-61 at 6/26/08 8:10 AM:
------------------------------------------------------------

I've started M/R:ing this now. More or less WordCount++, but I don't think my implementation is that nice.

This is what it is:

{code:java}

/** Parses and tokenizes an instance, maps to token weights but also stores a MapWritable<Text, Double> on fs */
class TokenizationMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /** weight */Text, /** weight */ DoubleWritable>

/** Reduces the sum of all weights per token */
class TokenizationReducer implements
    Reducer</** token */Text, /** weight */DoubleWritable, /** token */Text, /** weight */DoubleWritable> {

/** Sets up instance parser and tokenizer, 
     runs TokenizationMapper/TokenizationReducer 
     creates a MapWritable<Text, IntWritable> with column index, sorted by token frequency 
     and then runs VectorBuilderMapper (no reducer) to produce the final results. */
class TokenMatrixBuilderDriver {


/** Reads the MapWritable<Text, DoubleWritable> created by TokenizationMapper
     and maps the values to a Vector using the MapWritable<Text, IntWritable> produced by TokenMatrixBuilderDriver */
class VectorBuilderMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /**  instance identity */LongWritable, Vector> {

{code}

Is there a better way to do this? I'm in particular not enjoying how I store the token vector MapWritable in TokenizationMapper and open it up in the VectorBuilderMapper.

I never really use the "instance identity"-key.

I think the tokenization reducer should count number of instances a feature is used by rather than summing up the weight. Or perhaps a setting to control it. The thought is that it could be used as an initial crude feature selection scheme.

      was (Author: karl.wettin):
    I've started M/R:ing this now. More or less WordCount++, but I don't think my implementation is that nice.

This is what it is:

{code:java}

/** Parses and tokenizes an instance, maps to token weights but also stores a MapWritable<Text, Double> on fs */
class TokenizationMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /** weight */Text, /** weight */ DoubleWritable>

/** Reduces the sum of all weights per token */
class TokenizationReducer implements
    Reducer</** token */Text, /** weight */DoubleWritable, /** token */Text, /** weight */DoubleWritable> {

/** Sets up instance parser and tokenizer, 
     runs TokenizationMapper/TokenizationReducer 
     creates a MapWritable<Text, Integer> with column index, sorted by token frequency 
     and then runs VectorBuilderMapper (no reducer) to produce the final results. */
class TokenMatrixBuilderDriver {


/** Reads the MapWritable<Text, Double> created by TokenizationMapper
     and maps the values to a Vector  using 
class VectorBuilderMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /**  instance identity */LongWritable, Vector> {

{code}

Is there a better way to do this? I'm in particular not enjoying how I store the token vector MapWritable in TokenizationMapper and open it up in the VectorBuilderMapper.

I never really use the "instance identity"-key.

I think the tokenization reducer should count number of instances a feature is used by rather than summing up the weight. Or perhaps a setting to control it. The thought is that it could be used as an initial crude feature selection scheme.
  
> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789516#action_12789516 ] 

Sean Owen commented on MAHOUT-61:
---------------------------------

Same, is this still relevant? Looks kind of MAHOUT-116 which Ted said has been superseded, and related MAHOUT-19 seems defunct.

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789529#action_12789529 ] 

Ted Dunning commented on MAHOUT-61:
-----------------------------------


I take it back... this looks slightly useful.  

Grant, or Karl (if you still exist) can you comment on this?



> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Ted Dunning (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12789528#action_12789528 ] 

Ted Dunning commented on MAHOUT-61:
-----------------------------------


I think superseded.


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-61:
------------------------------

    Attachment: MAHOUT-61.txt

M/R version of previous patch. 

The only thing it does is to compile. I'll be replacing the todos with code soon enough. 

One thing I'm not quite certain about how to solve is how to handle features that are class values, for instance the news group when parsing 20NewsGroups.

Comments most appreciated. 

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Otis Gospodnetic (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12606981#action_12606981 ] 

Otis Gospodnetic commented on MAHOUT-61:
----------------------------------------

Re shingles, see LUCENE-400 -- lives in contrib with analyzers


> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12608463#action_12608463 ] 

Karl Wettin commented on MAHOUT-61:
-----------------------------------

I've started M/R:ing this now. More or less WordCount++, but I don't think my implementation is that nice.

This is what it is:

{code:java}

/** Parses and tokenizes an instance, maps to token weights but also stores a MapWritable<Text, Double> on fs */
class TokenizationMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /** weight */Text, /** weight */ DoubleWritable>

/** Reduces the sum of all weights per token */
class TokenizationReducer implements
    Reducer</** token */Text, /** weight */DoubleWritable, /** token */Text, /** weight */DoubleWritable> {

/** Sets up instance parser and tokenizer, 
     runs TokenizationMapper/TokenizationReducer 
     creates a MapWritable<Text, Integer> with column index, sorted by token frequency 
     and then runs VectorBuilderMapper (no reducer) to produce the final results. */
class TokenMatrixBuilderDriver {


/** Reads the MapWritable<Text, Double> created by TokenizationMapper
     and maps the values to a Vector  using 
class VectorBuilderMapper implements
    Mapper</**  instance identity */LongWritable, /** instance path */Path, /**  instance identity */LongWritable, Vector> {

{code}

Is there a better way to do this? I'm in particular not enjoying how I store the token vector MapWritable in TokenizationMapper and open it up in the VectorBuilderMapper.

I never really use the "instance identity"-key.

I think the tokenization reducer should count number of instances a feature is used by rather than summing up the weight. Or perhaps a setting to control it. The thought is that it could be used as an initial crude feature selection scheme.

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-61) Text problem matrix builder

Posted by "Karl Wettin (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Karl Wettin updated MAHOUT-61:
------------------------------

    Attachment: MAHOUT-61.txt

Created as mvn module examples. No Ant stuff yet. 

PostReader requires trunk/examples/resources/20news-bydate



> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Resolved: (MAHOUT-61) Text problem matrix builder

Posted by "Sean Owen (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-61?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Sean Owen resolved MAHOUT-61.
-----------------------------

    Resolution: Later

Given lack of action, going to shelve this.

> Text problem matrix builder 
> ----------------------------
>
>                 Key: MAHOUT-61
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-61
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Karl Wettin
>            Assignee: Karl Wettin
>            Priority: Minor
>         Attachments: MAHOUT-61.txt, MAHOUT-61.txt, MAHOUT-61.txt
>
>
> A set of classes that builds matrices from text.
> Currently the API consists of TokenMatrixBuilder and TokenInstanceBuilder. Should be thread safe.
> PostReader imports 20news-bydate. This takes several GB heap. It would be nice to bounce the data via JDBM or perhaps using the PersistentHashMap in MAHOUT-19.

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.