You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joel Bernstein (JIRA)" <ji...@apache.org> on 2016/07/07 15:40:11 UTC
[jira] [Commented] (SOLR-9252) Feature selection and logistic regression on text

    [ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15366283#comment-15366283 ] 

Joel Bernstein commented on SOLR-9252:
--------------------------------------

[~caomanhdat], I have the patch applied and have begun the review.

I've started with the FeaturesSelectionStream and IGainTermsQParserPlugin. I'll need more time and some collaboration to review the math. But I can say now that the mechanics of feature selection look very good. The use of the streaming framework and analytics query is really nice.

The one thing that we'll want to do is put some thought into how the features can be stored and retrieved.

Currently it looks like there is a tuple for each term/score pair. I think this works well using the update() function to send the tuples to another collection for storage. A few minor things to consider:

1) Should we use a field type post fix (term_s , score_f) to ensure that fields are indexed properly in another collection.

2) We'll need to add some kind of feature set ID so the feature set can be retrieved later. Each tuple will then be tagged with the feature set ID. Possibly adding a *featureSet* parameter to the stream makes sense for this.
 
3) We can also add a unique ID which will be used for the unique ID for each tuple in the index. We could concat the the term with the feature set ID to make the unique ID.







> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, enron1.zip
>
>
> SOLR-9186 come up with a challenges that for each iterative we have to rebuild the tf-idf vector for each documents. It is costly computation if we represent doc by a lot of terms. Features selection can help reducing the computation.
> Due to its computational efficiency and simple interpretation, information gain is one of the most popular feature selection methods. It is used to measure the dependence between features and labels and calculates the information gain between the i-th feature and the class labels (http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf).
> I confirmed that by running logistics regressions on enron mail dataset (in which each email is represented by top 100 terms that have highest information gain) and got the accuracy by 92% and precision by 82%.
> This ticket will create two new streaming expression. Both of them use the same *parallel iterative framework* as SOLR-8492.
> {code}
> featuresSelection(collection1, q="*:*",  field="tv_text", outcome="out_i", positiveLabel=1, numTerms=100)
> {code}
> featuresSelection will emit top terms that have highest information gain scores. It can be combined with new tlogit stream.
> {code}
> tlogit(collection1, q="*:*",
>          featuresSelection(collection1, 
>                                       q="*:*",  
>                                       field="tv_text", 
>                                       outcome="out_i", 
>                                       positiveLabel=1, 
>                                       numTerms=100),
>          field="tv_text",
>          outcome="out_i",
>          maxIterations=100)
> {code}
> In the iteration n, the text logistics regression will emit nth model, and compute the error of (n-1)th model. Because the error will be wrong if we compute the error dynamically in each iteration. 
> In each iteration tlogit will change learning rate based on error of previous iteration. It will increase the learning rate by 5% if error is going down and It will decrease the learning rate by 50% if error is going up.
> This will support use cases such as building models for spam detection, sentiment analysis and threat detection. 



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org