You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by sk...@sloan.mit.edu on 2010/07/23 08:30:10 UTC

Reverse Lucene queries

Hi all, I have an interesting problem...instead of going from a query
to a document collection, is it possible to come up with the best fit
query for a given document collection (results)? "Best fit" being a
query which maximizes the hit scores of the resulting document
collection.

How should I approach this? All suggestions appreciated.

Thanks
Shashi

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Reverse Lucene queries

Posted by Grant Ingersoll <gs...@apache.org>.
On Jul 23, 2010, at 5:06 AM, Karl Wettin wrote:

> 
> 23 jul 2010 kl. 08.30 skrev skant@sloan.mit.edu:
> 
>> Hi all, I have an interesting problem...instead of going from a query
>> to a document collection, is it possible to come up with the best fit
>> query for a given document collection (results)? "Best fit" being a
>> query which maximizes the hit scores of the resulting document
>> collection.
> 
> It would probably be helpful if you explained what it is you attempt to achieve by doing this. Are you looking for MoreLikeThis?

MatchAllDocsQuery returns the document collection all with a score of 1.  Somehow, I don't think this is what you are after.  Perhaps you mean given all the queries you've seen in the past, find the "best one"?

> 
>> How should I approach this? All suggestions appreciated.
> 
> 
> How exepensive of an operation is this allowed to be? Can you waste seconds, minutes, hours or days?
> Are there any requirements on the precision and recall?
> 
> I would no matter what start with looking at the output from a feature selection algorithm fed with the complete corpus divided in the two classes "query factory set" and "all other documents".
> 
> The output will not tell you why the terms are important, just that they probably are used when deciding when to classify documents as part of query factory set or all other documents.
> 
> It's hard to say where to go from there.
> 
> Create a set of selected terms available in the query factory set.
> Create a set of selected terms available in all other documents.
> Create a set of selected terms only available in the query factory set.
> Create a set of selected terms only available in all other documents.
> 
> See if there is a simple strategy based on above that produce a good result.
> 
> If not you might want to look in to some evolving algorithm that execute queries with permutations of selected features in order to find the best query. Or if you have the resources, simply create all permutation of queries.
> 
> If it works then I think all of the steps above could be optimized, cached or simplified in several ways to make it speedy.
> 
> See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer, etc for machine learning APIs.
> 
> It should not have to be too complicated to implement a gain ratio feature selector using IndexReader if the term vector space is available.
> 
> 
> 	karl
> 
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Reverse Lucene queries

Posted by Karl Wettin <ka...@gmail.com>.
23 jul 2010 kl. 08.30 skrev skant@sloan.mit.edu:

> Hi all, I have an interesting problem...instead of going from a query
> to a document collection, is it possible to come up with the best fit
> query for a given document collection (results)? "Best fit" being a
> query which maximizes the hit scores of the resulting document
> collection.

It would probably be helpful if you explained what it is you attempt  
to achieve by doing this. Are you looking for MoreLikeThis?

> How should I approach this? All suggestions appreciated.


How exepensive of an operation is this allowed to be? Can you waste  
seconds, minutes, hours or days?
Are there any requirements on the precision and recall?

I would no matter what start with looking at the output from a feature  
selection algorithm fed with the complete corpus divided in the two  
classes "query factory set" and "all other documents".

The output will not tell you why the terms are important, just that  
they probably are used when deciding when to classify documents as  
part of query factory set or all other documents.

It's hard to say where to go from there.

Create a set of selected terms available in the query factory set.
Create a set of selected terms available in all other documents.
Create a set of selected terms only available in the query factory set.
Create a set of selected terms only available in all other documents.

See if there is a simple strategy based on above that produce a good  
result.

If not you might want to look in to some evolving algorithm that  
execute queries with permutations of selected features in order to  
find the best query. Or if you have the resources, simply create all  
permutation of queries.

If it works then I think all of the steps above could be optimized,  
cached or simplified in several ways to make it speedy.

See Mahout, Weka (has a good experimenter/explorer GUI), Rapidminer,  
etc for machine learning APIs.

It should not have to be too complicated to implement a gain ratio  
feature selector using IndexReader if the term vector space is  
available.


	karl


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org