You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@mahout.apache.org by "Shashikant Kore (JIRA)" <ji...@apache.org> on 2009/05/29 09:52:45 UTC

[jira] Created: (MAHOUT-126) Prepare document vectors from the text

Prepare document vectors from the text
--------------------------------------

                 Key: MAHOUT-126
                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
             Project: Mahout
          Issue Type: New Feature
            Reporter: Shashikant Kore


Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 

1. Create lucene index of the input  plain-text documents 
2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 

Presently, I have created two separate utilities, which could possibly be invoked from another class. 



-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

Re: [jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by David Hall <dl...@cs.stanford.edu>.

Ignore this. Wrong issue.

On Fri, Jun 19, 2009 at 12:59 AM, David Hall (JIRA)<ji...@apache.org> wrote:
>
>     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]
>
> David Hall updated MAHOUT-126:
> ------------------------------
>
>    Attachment: MAHOUT-123.patch
>
> Ok, I'm going to call this a mostly functional patch.
>
>> Prepare document vectors from the text
>> --------------------------------------
>>
>>                 Key: MAHOUT-126
>>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>>             Project: Mahout
>>          Issue Type: New Feature
>>    Affects Versions: 0.2
>>            Reporter: Shashikant Kore
>>            Assignee: Grant Ingersoll
>>             Fix For: 0.2
>>
>>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>>
>>
>> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks.
>> 1. Create lucene index of the input  plain-text documents
>> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily.
>> Presently, I have created two separate utilities, which could possibly be invoked from another class.
>
> --
> This message is automatically generated by JIRA.
> -
> You can reply to this email to add a comment to the issue online.
>
>

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721351#action_12721351 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Yep, you are right.  I committed your patch anyway.  We probably should add to the cmd line to support setting minDF, maxDF.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714362#action_12714362 ] 

David Hall commented on MAHOUT-126:
-----------------------------------

 Sure, I just want to be able to have:

 double weight = similarity.tf(termFreq) * similarity.idf(docFreq, numDocs);

be this instead:

double weight = termFreq

based on some configuration or another. (Maybe if I can just pass in a custom "Similarity" object? Or there could be a protected method "createSimilarity" that I could override?)

Basically, LDA wants raw counts (or at least, some kind of integers).

Thanks!


> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment: MAHOUT-126-no-normalization.patch

Ok, here's the patch for normalization. other one forthcoming.

Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-126:
-----------------------------------

    Attachment: MAHOUT-126.patch

Here's a first attempt at my thoughts based on the two previous patches, plus some other ideas.

The main gist of the idea centers around the VectorIterable interface and is driven by the o.a.mahout.utils.vectors.Driver class.

Note, I dropped the Lucene indexing part, as I don't think we need to be in the game of creating Lucene indexes.  That is a well known and well document process that is available elsewhere.  In fact, for this particular piece, I indexed Wikipedia in Solr and then pointed the Driver class at the Lucene index.

See http://cwiki.apache.org/confluence/display/MAHOUT/Creating+Vectors+from+Text for details on using.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720816#action_12720816 ] 

David Hall commented on MAHOUT-126:
-----------------------------------

LuceneIteratable (is that an intentional pun?) has behavior that isn't documented well. Namely, if the normless constructor is called, the norm defaults to 2.

This has the consequence that not passing in a norm to Driver L2 normalizes the vectors. You have to specify a negative double != -1.0 to get unnormalized counts. Relatedly, -1 maps to the L2 norm. This is odd behavior to me, or it should at least be documented. (The wiki article implies there's a difference between using --norm 2 and using no norm at all.)

Also, I'd like an option to tell Driver what weight object to use. I can do the patch for this.

Thanks!

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714426#action_12714426 ] 

Benson Margulies commented on MAHOUT-126:
-----------------------------------------

This patch needs to explicitly manage the character set of the files it is reading. It uses FileReader without specifying a character set.


> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Assigned: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll reassigned MAHOUT-126:
--------------------------------------

    Assignee: Grant Ingersoll

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720824#action_12720824 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Agreed about the weirdness on the default norms.  Yeah, patch would be great. 

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment:     (was: MAHOUT-123.patch)

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment: MAHOUT-126-null-entry.patch

I'm going to assume that's the problem. The attached patch just skips over any null term vectors. It seems like reasonable behavior here, given the filtering.



> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714414#action_12714414 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

See SOLR-1193.  

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714401#action_12714401 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Passing in a way to make a custom weight object makes sense

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment: MAHOUT-123.patch

Ok, I'm going to call this a mostly functional patch.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720686#action_12720686 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Committed revision 785618.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-126:
-----------------------------------

    Attachment: MAHOUT-126.patch

Updated patch since MAHOUT-65-name.patch was committed.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720880#action_12720880 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

bq. Also, I'm getting null pointers out of VectorMapper after building an index using Lucene's demo indexer. I'm going to follow the solr instructions and see if I have better luck.

This stuff requires Term Vectors to be enabled in the Lucene index.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720931#action_12720931 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

I committed the no-norm thing with some slight mods, since it is not merely valid to check to see if NO_NORMALIZATION since a value < 0 is not valid.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714348#action_12714348 ] 

David Hall commented on MAHOUT-126:
-----------------------------------

I actually need something like this as well for LDA, except that I would prefer to be able to have the vectors not TF-IDF weighted. Could I get you to add some way of configuring that?

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12720882#action_12720882 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Also, I don't use git, is there a way to produce a patch that is consumable by the patch utility, or provide the options needed to run. 

In SVN, I do:
{code}
svn diff > ../mypatch.patch
{code}

and then apply as:
patch -p 0 -i ../mypatch.patch



> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721346#action_12721346 ] 

David Hall commented on MAHOUT-126:
-----------------------------------

That's not the only time. This constructor clearly lets certain things slip through.

{code}
  public CachedTermInfo(IndexReader reader, String field, int minDf, int maxDfPercent) throws IOException {
    this.field = field;
    TermEnum te = reader.terms(new Term(field, ""));
    int count = 0;
    int numDocs = reader.numDocs();
    double percent = numDocs * maxDfPercent / 100.0;
    //Should we use a linked hash map so that we no terms are in order?
    termEntries = new LinkedHashMap<String, TermEntry>();
    do {
      Term term = te.term();
      if (term == null || term.field().equals(field) == false){
        break;
      }
      int df = te.docFreq();
      if (df < minDf || df > percent){
        continue;
      }
      TermEntry entry = new TermEntry(term.text(), count++, df);
      termEntries.put(entry.term, entry);
    } while (te.next());
    te.close();
{code}

My code is essentially Lucene's demo indexing code (IndexFiles.java and FileDocument.java: http://google.com/codesearch/p?hl=en&sa=N&cd=1&ct=rc#uGhWbO8eR20/trunk/src/demo/org/apache/lucene/demo/FileDocument.java&q=org.apache.lucene.demo.IndexFiles
} except that I replaced
{code}doc.add(new Field("contents", new FileReader(f)));{code}

with
{code}   doc.add(new Field("contents", new FileReader(f),Field.TermVector.YES));{code}

I then ran {code} java -cp <classpath> org.apache.lucene.demo.IndexFiles /Users/dlwh/txt-reuters/ {code}

and then {code} java -cp <classpath> org.apache.mahout.utils.vectors.Driver --dir /Users/dlwh/src/lucene/index/ --output ~/src/vec-reuters -f contents -t /Users/dlwh/dict --weight TF {code}

For what's it worth, it gives a null on "reuters", which is not usually a stop word, except that every single document ends with it, and so the IDF filtering above is catching it.



> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714693#action_12714693 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

bq. Seems like we should be able to avoid caching the whole term list in memory. At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT. DF lookup is expensive in Lucene. If you don't cache the whole list, we should at least have an LRU cache for DF.

Never mind, I see why the list is cached.  Still, makes sense to cache the DF.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Comment: was deleted

(was: Ok, I'm going to call this a mostly functional patch.)

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719454#action_12719454 ] 

Shashikant Kore commented on MAHOUT-126:
----------------------------------------

Grant, 

I went through the patch.  Compilation failed with following error.  
"Driver.java:[111,26] FSDirectory(java.io.File,org.apache.lucene.store.LockFactory) has protected access in org.apache.lucene.store.FSDirectory"

So, I haven't really run the code.

Overall, the code looks good.  Now I understand TermVectorMapper. 

Should  VectorMapper be taken as an option? David had commented that he wants vectors with DF as weights. He could add, say DFMapper, to get desired weights. 

I think, document labelling (Mahout-65) also needs to in soon  because it will require chanes to this code.  Mostly those changes will reflect in LuceneIteratable. 


> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Shashikant Kore updated MAHOUT-126:
-----------------------------------

    Attachment: MAHOUT-126.patch

Patch to create index and document vectors from text.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721068#action_12721068 ] 

David Hall commented on MAHOUT-126:
-----------------------------------

Ok, I'm probably misunderstanding something, or there could be a bug. I modified Lucene's demo indexer to store a term vector. It's still crashing. I added a series of printlns before TermVector.java:65 and CachedTermInfo:71, and I end up with the assertion here failing:

{{
 @Override
  public TermEntry getTermEntry(String field, String term) {
    if (this.field.equals(field) == false){ return null;}
    TermEntry ret =  termEntries.get(term);
    assert(ret != null); // This assertion is firing.
    return ret;
  }
}}

In my dataset, this happens after several hundred iterations. The term is a stop-word for the corpus in question, and it looks like there's an attempt at stopwording earlier in the file. Maybe these are not interacting well?

-- David

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714509#action_12714509 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

So just kind of brainstorming here, but I think we should create a separate Module for this kind of stuff, to keep out of core and give us some more flexibility in regards to dependencies, etc.

Also (and I realize this is just a start patch), I think we should assume a Lucene index exists already instead of maintaining code to actually create an index.  There are a lot of ways to do that and people will likely have different fields, etc.  For instance, Solr can provide all of the capabilities here and it has distributed support, so it can scale.  Moreover, though, is people may have the info in a DB or in other places.  I realize we need baby steps, but...

I'll try to post a patch this afternoon that takes this effort and melds it with some of my ideas for demo purposes.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-126:
-----------------------------------

        Fix Version/s: 0.2
    Affects Version/s: 0.2
               Status: Patch Available  (was: Open)

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12719524#action_12719524 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Yeah, still needs the labeling stuff.

As for weights, you should be able to pass in a Weight object.  See the TFIDF implementation.  Likely still needs some work.

As for the Lucene error, I thought I had updated the Lucene version to be 2.9-dev, which I believe makes this all right.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment: MAHOUT-126-TF.patch

This patch contains an implementation of a TF weight, and it adds the --weight option to Driver to support its use. Default is TFIDF. An error is thrown on input besides TFIDF or TF.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-126:
-----------------------------------

    Attachment: MAHOUT-126.patch

Here's a version that is brought up to trunk and adds in MAHOUT-65-name.patch to allow for labeling the vectors.

Next, I'm going to run the output through some clustering

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Benson Margulies (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Benson Margulies updated MAHOUT-126:
------------------------------------

    Attachment: mahout-126-benson.patch

Improved patch. Allows specification of file character set. Applyable inside of eclipse.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Shashikant Kore (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714356#action_12714356 ] 

Shashikant Kore commented on MAHOUT-126:
----------------------------------------

David,

Sorry, I don't have any background in LDA. Please take a look at the patch and suggest what changes are required in DocumentVector.getDocumentVector() method. I will do rest of the changes of configuration.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>         Attachments: MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12721215#action_12721215 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Hey David,

I'm not sure what's going on here, because that value being null means the term is not the index, yet is in the Term Vector for that doc.  Are you sure you're loading the same field?  Can you share the indexing code?

This fix works, though, but I'd like to know at a deeper level what's going on.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "David Hall (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

David Hall updated MAHOUT-126:
------------------------------

    Attachment: MAHOUT-126-no-normalization.patch

My bad. git-format-patch formats an email that has a patch (sigh) and not a patch itself. 

Run the command you pasted on the new patch.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12717742#action_12717742 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Note, I haven't actually tried clustering just yet with the output!

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Commented: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

    [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=12714515#action_12714515 ] 

Grant Ingersoll commented on MAHOUT-126:
----------------------------------------

Shashikant,

Couple of comments on the Lucene specific stuff, though, so that you guys can speed up what you have.

First off, have a look at Lucene's support of TermVectorMapper.  Much like SAX, it gives you a call back mechanism such that you don't have to construct two different data structures (i.e. many people incorrectly use the DOM to parse XML and then extract out of the DOM into their own Data Structure when they should use SAX instead).

You might have a look at the TermVectorComponent in Solr, as it pretty much does what you are looking to do in this patch and I believe it to be more efficient.

Seems like we should be able to avoid caching the whole term list in memory.  At a minimum, if you are going to, allTerms should be a Map<String, Integer> that stores the term and it's DF (doc freq.), as you are currently doing the DF lookup twice, AFAICT.  DF lookup is expensive in Lucene.  If you don't cache the whole list, we should at least have an LRU cache for DF.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>         Attachments: mahout-126-benson.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.

[jira] Updated: (MAHOUT-126) Prepare document vectors from the text

Posted by "Grant Ingersoll (JIRA)" <ji...@apache.org>.

     [ https://issues.apache.org/jira/browse/MAHOUT-126?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Grant Ingersoll updated MAHOUT-126:
-----------------------------------

    Resolution: Fixed
        Status: Resolved  (was: Patch Available)

I think this is in pretty good shape for now, can open new issues to deal with specific problems.

> Prepare document vectors from the text
> --------------------------------------
>
>                 Key: MAHOUT-126
>                 URL: https://issues.apache.org/jira/browse/MAHOUT-126
>             Project: Mahout
>          Issue Type: New Feature
>    Affects Versions: 0.2
>            Reporter: Shashikant Kore
>            Assignee: Grant Ingersoll
>             Fix For: 0.2
>
>         Attachments: mahout-126-benson.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-no-normalization.patch, MAHOUT-126-null-entry.patch, MAHOUT-126-TF.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch, MAHOUT-126.patch
>
>
> Clustering algorithms presently take the document vectors as input.  Generating these document vectors from the text can be broken in two tasks. 
> 1. Create lucene index of the input  plain-text documents 
> 2. From the index, generate the document vectors (sparse) with weights as TF-IDF values of the term. With lucene index, this value can be calculated very easily. 
> Presently, I have created two separate utilities, which could possibly be invoked from another class. 

-- 
This message is automatically generated by JIRA.
-
You can reply to this email to add a comment to the issue online.