You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by "Joel Bernstein (JIRA)" <ji...@apache.org> on 2016/11/30 15:18:00 UTC

[jira] [Resolved] (SOLR-9252) Feature selection and logistic regression on text

     [ https://issues.apache.org/jira/browse/SOLR-9252?page=com.atlassian.jira.plugin.system.issuetabpanels:all-tabpanel ]

Joel Bernstein resolved SOLR-9252.
----------------------------------
    Resolution: Resolved

> Feature selection and logistic regression on text
> -------------------------------------------------
>
>                 Key: SOLR-9252
>                 URL: https://issues.apache.org/jira/browse/SOLR-9252
>             Project: Solr
>          Issue Type: Improvement
>      Security Level: Public(Default Security Level. Issues are Public) 
>          Components: search, SolrCloud, SolrJ
>            Reporter: Cao Manh Dat
>            Assignee: Joel Bernstein
>              Labels: Streaming
>             Fix For: 6.2
>
>         Attachments: SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9252.patch, SOLR-9299-1.patch
>
>
> This ticket adds two new streaming expressions: *features* and *train*
> These two functions work together to train a logistic regression model on text, from a training set stored in a SolrCloud collection.
> The syntax is as follows:
> {code}
> train(collection1, q="*:*",
>       features(collection1, 
>                q="*:*",  
>                field="body", 
>                outcome="out_i", 
>                positiveLabel=1, 
>                numTerms=100),
>       field="body",
>       outcome="out_i",
>       maxIterations=100)
> {code}
> The *features* function extracts the feature terms from a training set using *information gain* to score the terms. http://www.jiliang.xyz/publication/feature_selection_for_classification.pdf
> The *train* function uses the extracted features to train a logistic regression model on a text field in the training set.
> For both *features* and *train* the training set is defined by a query. The doc vectors in the *train* function use tf-idf to represent the terms in the document. The idf is calculated for the specific training set, allowing multiple training sets to be stored in the same collection without polluting the idf. 
> In the *train* function a batch gradient descent approach is used to iteratively train the model.
> Both the *features* and the *train* function are embedded in Solr using the AnalyticsQuery framework. So only the model is transported across the network with each iteration.
> Both the features and the models can be stored in a SolrCloud collection. Using this approach Solr can hold millions of models which can be selectively deployed. For example a model could be trained for each user, to personalize ranking and recommendations.
> Below is the final iteration of a model trained on the Enron Ham/Spam dataset. The model includes the terms and their idfs and weights as well as a classification evaluation describing the accuracy of model on the training set. 
> {code}
> {
> 			"idfs_ds": [1.2627703388716238, 1.2043595767152093, 1.3886172425360304, 1.5488587854881268, 1.6127302558747882, 2.1359177807201526, 1.514866246141212, 1.7375701403808523, 1.6166175299631897, 1.756428159015249, 1.7929202354640175, 1.2834893120635762, 1.899442866302021, 1.8639061320252337, 1.7631697575821685, 1.6820002892260415, 1.4411352768194767, 2.103708877350535, 1.2225773869965861, 2.208893321170597, 1.878981794430681, 2.043737027506736, 2.2819184561854864, 2.3264563106163885, 1.9336117619172708, 2.0467265663551024, 1.7386696457142692, 2.468795829515302, 2.069437610615317, 2.6294363202479327, 3.7388303845193307, 2.5446615802900157, 1.7430797961918219, 3.0787440662202736, 1.9579702057493114, 2.289523055570706, 1.5362003886162032, 2.7549569891263763, 3.955894889757158, 2.587435396273302, 3.945844553903657, 1.003513057076781, 3.0416264032637708, 2.248395764146843, 4.018415246738492, 2.2876164773001246, 3.3636289340509933, 1.2438124251270097, 2.733903579928544, 3.439026951535205, 0.6709665389201712, 0.9546224358275518, 2.8080115520822657, 2.477970205791343, 2.2631561797299637, 3.2378087608499606, 0.36177021415584676, 4.1083634834014315, 4.120197941048435, 2.471081544796158, 2.4241455557775633, 2.923393626201111, 2.9269972337044097, 3.2987413118451183, 2.383498249003407, 4.168988105217867, 2.877691472720256, 4.233526626355437, 3.8505343740993316, 2.3264563106163885, 2.6429318017228174, 4.260555298743357, 3.0058372954121855, 3.8688835127675283, 3.021585652380325, 3.0295538220295017, 1.9620882623582288, 3.469610374907285, 3.945844553903657, 3.4821105376715167, 4.3169082352944885, 2.520329479630485, 3.609372317282444, 3.070375816549757, 4.220281399605417, 3.9866665484239117, 3.6165408067610563, 3.788840805093992, 4.392131656532076, 4.392131656532076, 2.837281934382379, 3.698984475972131, 4.331507034715641, 2.360699334038601, 2.7368842080666815, 3.730733174286711, 3.1991566064156816, 4.4238803548466565, 2.4665153268165767, 3.175736332207583, 3.2378087608499606, 4.376627469996111, 3.3525177086259226, 3.28315658082842, 4.156565585219309, 1.6462639699299098, 2.673278958112109, 4.331507034715641, 3.955894889757158, 2.7764631943473397, 3.0497565293470212, 1.79060004880832, 3.6237610547345436, 1.6244377066690232, 2.948895919012047, 3.175736332207583, 2.850571166501062, 4.073677925413541, 2.725014632511298, 3.1573871935393867, 4.562030693327474, 3.5403794457954922, 4.580722826339627, 4.580722826339627, 3.189722574182323, 3.1665196771026594, 3.3306589148134234, 1.9745451708435238, 3.3306589148134234, 2.795272526304836, 3.3415285870503273, 4.407880013500216, 4.4238803548466565, 2.6902285164258823, 3.668212817305377, 4.543681554659277, 2.559550192783766, 1.5452257206382456, 2.2631561797299637, 4.659194441781121, 3.2678110111537597, 3.878185905429842, 3.3525177086259226, 3.374865007317919, 3.780330115426083, 4.376627469996111, 3.433020927474993, 3.6758174166905966, 4.288334862850433, 3.2378087608499606, 4.490571729345329, 2.9269972337044097, 4.029226162842708, 3.0538465145985465, 4.440140875718437, 3.533734903076824, 4.659194441781121, 4.659194441781121, 4.525663049156599, 3.706827653433157, 3.1172927363375087, 4.490571729345329, 2.552078177945065, 2.087985282971078, 4.83744267318744, 4.562030693327474, 4.09666744363824, 4.659194441781121, 1.802255192400069, 4.599771021310321, 3.788840805093992, 4.8621352857778115, 4.6798137289838575, 4.376627469996111, 3.272900080661231, 3.8970543897342247, 4.638991734463602, 4.638991734463602, 4.813345121608379, 4.813345121608379, 4.8621352857778115, 4.83744267318744, 3.588170109631841, 4.13217413209515, 4.599771021310321, 4.331507034715641, 3.134914337687328, 4.525663049156599, 4.722373343402653, 3.955894889757158, 4.967495801435638, 4.580722826339627, 4.967495801435638, 4.9134285801653625, 4.887453093762102, 4.407880013500216, 4.246949646687578, 2.198385343572182, 1.5963758750107606, 4.007719957621744],
> 			"alpha_d": 7.150861416624748E-4,
> 			"terms_ss": ["enron", "2000", "cc", "hpl", "daren", "http", "gas", "forwarded", "pm", "ect", "hou", "thanks", "meter", "2001", "attached", "deal", "am", "farmer", "your", "nom", "corp", "more", "mmbtu", "xls", "here", "j", "let", "volumes", "questions", "www", "2004", "sitara", "no", "money", "01", "volume", "know", "best", "meds", "bob", "prescription", "please", "online", "file", "viagra", "02", "stop", "me", "nomination", "v", "on", "i", "click", "texas", "03", "prices", "for", "paliourg", "php", "09", "contract", "fyi", "actuals", "u", "04", "pain", "713", "drugs", "microsoft", "email", "robert", "cialis", "melissa", "investment", "teco", "pat", "11", "save", "professional", "world", "biz", "flow", "dollars", "noms", "2005", "act", "remove", "results", "soft", "xp", "mary", "80", "spam", "following", "06", "software", "n", "dealer", "08", "ena", "offer", "sex", "products", "special", "compliance", "see", "free", "cheap", "html", "07", "gary", "000", "low", "our", "houston", "many", "april", "size", "r", "tap", "lots", "product", "pills", "xanax", "vance", "ami", "chokshi", "12", "clynes", "ticket", "counterparty", "super", "thousand", "daily", "offers", "weight", "05", "all", "call", "photoshop", "julie", "stock", "lisa", "steve", "million", "health", "site", "quality", "stocks", "link", "featured", "net", "international", "most", "investing", "works", "readers", "uncertainties", "differ", "news", "david", "seek", "31", "only", "1933", "creative", "windows", "subscribers", "should", "adobe", "security", "1934", "valium", "brand", "visit", "action", "canon", "pharmacy", "sexual", "inherent", "construed", "assumptions", "internet", "mobile", "risks", "wide", "smith", "ex", "pill", "states", "projections", "medications", "predictions", "anticipates", "deciding", "events", "advice", "now", "com", "browser"],
> 			"iteration_i": 100,
> 			"weights_ds": [0.9524452699893067, -2.9257423290160225, -2.122240862520573, -0.40259380863176036, -1.242508927269482, -2.1933952666745924, 0.9119553386109202, -1.3359582128074137, -1.1717690853817335, -0.9029380383621088, -1.970576222154978, -0.9180539343040344, -2.031736167842155, -1.382820037232718, -1.4296530557007743, -1.5015080966872794, -0.852373483913152, -0.2883706803921614, -0.2366741375717678, 0.2966401203916763, -0.6792566685980972, -0.18912751254722837, 0.10265566994945839, -1.0065678789783332, -0.8967357570889625, 0.041722607774742765, -0.2832721589409925, -0.400560390908784, -0.6945385025086017, -0.8488391208665993, -0.31851465800191403, 1.570768257518063, -1.5144615060332418, 0.9411280928801138, 0.738478999511349, -0.6875177906594712, -0.47841730767672286, -0.20502227184813, 0.4858041557455349, 1.389551367014946, -0.8886199496843126, 0.8029699876855549, -0.7760217032166719, 0.40175437931353053, -0.6231018791954438, 1.0261571991645586, -0.44254206613371744, 0.31955072203529183, -0.24171600421157927, -0.632533557090375, 0.774533771979748, -1.1164595912116915, -0.2954704188664946, 0.27653823698423186, -1.157867306631878, -5.49332153268076E-5, 0.6916900118076985, -1.305726586870522, 1.370623007467874, 1.1100575515185573, 0.40953153124448194, -0.4273267120664356, -0.5536271317082946, -0.03575915648164506, 0.20475308352558616, -0.2919021960690356, 1.1094392826383312, -1.24904822249928, 1.038764158800864, 0.10525284214114823, 0.1973739189626828, -0.33283870614700184, 1.0555375704790861, 0.25856879498650104, 0.921918816504445, -0.15711181528461088, -0.3594966291171786, -0.6659758614594922, -0.3342439009175488, 0.3592708173532555, 0.12872616265365205, 1.362140022970902, -0.2699930594417464, 0.7449118829650243, -0.12665949567352622, 1.1289376146405283, 0.1653713075673579, 0.7008424353370497, 0.47095485852014707, 1.021689093687625, 1.0049928692400525, -0.18114402652386635, 0.4403400905532737, 1.0570966104647033, -1.167541821576636, -0.4428853975686944, 0.20694894484760668, 0.15472835818468766, 1.0009582999260647, 0.013730849275970687, -0.3882888402977611, 0.14102499499877702, 1.1560852477692065, -0.822855520787489, -0.1468595831916683, 0.9069870716505091, -0.18884872126960675, -0.19213990843838719, -0.0032534107278622496, 0.2715800337813452, 0.0888346122807297, -0.37031213468904256, -0.07224227291981163, 0.08850381657180348, 0.20501283264716516, -0.5852130122059844, 0.11807896760332989, -1.3196626232666966, 0.5324969558412787, 0.7667504164777665, 0.11805357030082002, 1.0020954114301253, -0.10885082229805468, 1.003094962524753, 1.0000914796917044, 0.0094959191513861, -0.5127276009526891, 0.059129413669497796, -0.49311249434449955, 0.34652229330274653, -0.7618731785587705, -0.3514318991274448, 0.7742232232987654, 0.7575763908124484, -0.25192129997930635, -0.24220187762559128, 1.0014232005812307, -0.3453736248293833, -0.1121687186012911, -0.15547543099631278, 1.0840890597241875, -0.2879034857435273, -0.227656977034567, -0.3716602841157388, 0.18007113168986144, 0.8297688092273079, 1.405797209837956, 0.3921445898278919, 1.079363745455813, -0.6253022693091732, 0.33155358331572704, 0.9644709831096733, -0.19686285814583682, 1.1069098903214452, -0.19597970694899214, -0.29329229099344734, -0.037185151648282316, 1.0010206696926418, 1.0096586146138415, 0.9523090849946898, 0.34253175617551923, -0.41826608329006, 0.7213729935258942, -0.47416007242000024, 0.3210039942978008, 1.0, 0.9772041721907345, 0.2533596337281238, 0.9839657417973666, -0.7583308570783015, 0.9476391050914625, 0.2534925274818649, 1.0, 1.0001125385832383, 0.37796474985487505, 0.3839828352290301, 0.44224405246124543, 1.046072941713049, 1.1205405856642119, 0.9165436674154628, 0.9586701268580604, 1.0000000000000968, 0.9860828147022696, -0.32499900116244823, 1.1624049652694368, 0.4966278258894532, -0.14840111822378488, 0.15131204240736265, 1.114787005544689, 1.1782663102351227, 0.21291210471466848, 1.0000000000385034, 0.9564718923455356, 1.0110628413440756, 1.000156375636503, 0.9763045864950046, 0.2630059727829917, 0.24199402427272665, 0.2736018381908099, -0.7673296746900424, -0.1899398724099395],
> 			"field_s": "body",
> 			"trueNegative_i": 3570,
> 			"falseNegative_i": 35,
> 			"falsePositive_i": 75,
> 			"error_d": 176.8112932306374,
> 			"truePositive_i": 1381,
> 			"id": "model_100"
> 		}
> {code}



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org