You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by "Doug Turnbull (JIRA)" <ji...@apache.org> on 2016/10/20 14:14:58 UTC

[jira] [Commented] (SOLR-9418) Probabilistic-Query-Parser RequestHandler

    [ https://issues.apache.org/jira/browse/SOLR-9418?page=com.atlassian.jira.plugin.system.issuetabpanels:comment-tabpanel&focusedCommentId=15591921#comment-15591921 ] 

Doug Turnbull commented on SOLR-9418:
-------------------------------------

Looking at your patch (I'm not a committer just curious about the patch). A few things jump out in a shallow reading that would probably need to change for this to be accepted:

- Field names and thresholds likely need to be configurable, as most folks won't nescesarilly have a field named exactly "title" or "content." 
- Can this be a qparser plugin instead of a request handler? It's likely I'd want to use it alongside other qparsers and SearchComponents (like highlighting or facets).
- Can you provide some documentation on how the thresholds work/can be configured?

> Probabilistic-Query-Parser RequestHandler
> -----------------------------------------
>
>                 Key: SOLR-9418
>                 URL: https://issues.apache.org/jira/browse/SOLR-9418
>             Project: Solr
>          Issue Type: New Feature
>      Security Level: Public(Default Security Level. Issues are Public) 
>            Reporter: Akash Mehta
>         Attachments: SOLR-9418.zip
>
>
> The main aim of this requestHandler is to get the best parsing for a given query. This basically means recognizing different phrases within the query. We need some kind of training data to generate these phrases. The way this project works is:
> 1.)Generate all possible parsings for the given query
> 2.)For each possible parsing, a naive-bayes like score is calculated.
> 3.)The main scoring is done by going through all the documents in the training set and finding the probability of bunch of words occurring together as a phrase as compared to them occurring randomly in the same document. Then the score is normalized. Some higher importance is given to the title field as compared to content field which is configurable.
> 4.)Finally after scoring each of the possible parsing, the one with the highest score is returned.



--
This message was sent by Atlassian JIRA
(v6.3.4#6332)

---------------------------------------------------------------------
To unsubscribe, e-mail: dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: dev-help@lucene.apache.org