You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Michael Masters <mm...@gmail.com> on 2009/10/01 18:56:21 UTC

document diversity

I was wondering if there is any way to control what kind of documents
are returned from a search. For example, lets say we have an index
built from different types of documents (pdf, txt, html, etc.). Is
there a way to have the first x results have a specified distribution
of document types? It would be nice to have an even number of results
that are from pdfs, txt files, and html files.


Any help would greatly be appreciated.


-Mike

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document diversity

Posted by Phil Whelan <ph...@gmail.com>.

Hi Mike,

I'd simply store a field "doctype" with values "pdf", "txt", "html"
and perform a separate search for each type. Although, I'd be
interested if anyone has a cooler way of doing this.

Cheers,
Phil

On Thu, Oct 1, 2009 at 9:56 AM, Michael Masters <mm...@gmail.com> wrote:
> I was wondering if there is any way to control what kind of documents
> are returned from a search. For example, lets say we have an index
> built from different types of documents (pdf, txt, html, etc.). Is
> there a way to have the first x results have a specified distribution
> of document types? It would be nice to have an even number of results
> that are from pdfs, txt files, and html files.
>
>
> Any help would greatly be appreciated.
>
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>



-- 
Cell : +1  778-233-4935
Skype : philwhelan76
Twitter : http://www.twitter.com/philwhln

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document diversity

Posted by Tricia Williams <wi...@gmail.com>.

Hi Mike,

The first thing that comes to mind is to run a query for each document 
type (assuming that you have a field that stores the type) and qualify 
the document type: for example type:pdf.  Then you would have to write 
something to combine the query results drawing an equal number of hits 
from each query result.  I think that the score would still be able to 
provide you with an appropriate relevance score if you wanted to 
maintain relevance order across all the hits from all document types, 
but my logic could be faulty there.

Tricia

Michael Masters wrote:
> I was wondering if there is any way to control what kind of documents
> are returned from a search. For example, lets say we have an index
> built from different types of documents (pdf, txt, html, etc.). Is
> there a way to have the first x results have a specified distribution
> of document types? It would be nice to have an even number of results
> that are from pdfs, txt files, and html files.
>
>
> Any help would greatly be appreciated.
>
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
>   

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document diversity

Posted by Simon Willnauer <si...@googlemail.com>.

Michael,
this sounds like a pretty good usecase for CustomScoreQuery
(http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/search/function/CustomScoreQuery.html)
The org.apache.lucene.search.function package provides flexible
programmatic control over document scores. You boost up documents with
greatest revenue or throw in some factors based on potential revenue
even if the revenue "value" for a document has changed since it was
indexed.

Another way would be to only update the norm value of a document
periodically for documents with changing revenue. This prevents you
from re-indexing. See IndexReader#setNorm (
http://lucene.apache.org/java/2_9_0/api/core/org/apache/lucene/index/IndexReader.html#setNorm(int,%20java.lang.String,%20byte)
)

simon


On Tue, Oct 6, 2009 at 10:33 PM, Michael Masters <mm...@gmail.com> wrote:
> My initial description may have been a little abstract. Maybe I should
> explain exactly what I'm trying to do. My company has various revenue
> channels, one of which is per click. If a user does a search, we would
> like to show results with the greatest revenue, although we don't want
> people to be able to buy all the top results. Hence, we would like to
> have some way of mixing results. The mixing of results could be based
> of potential revenue, relevancy, which revenue stream the result is
> associated with, etc.
>
> The previously mentioned ideas are great btw.
>
> -Mike
>
>
> On Sat, Oct 3, 2009 at 4:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
>> I'm curious, can you elaborate more on the deeper use case for this?
>>
>> Perhaps just implementing faceting on doc type would be sufficient?  That
>> way users can drill in on doc type.  Alternatively, I suppose you could
>> implement a hit collector that accesses a field cache on the doc type field
>> and promotes lesser seen doc types until they are evenly represented.  Could
>> also likely write a Function query that does a similar thing.  I'd imagine
>> you need to be careful to control your memory.
>>
>> -Grant
>>
>> On Oct 1, 2009, at 12:56 PM, Michael Masters wrote:
>>
>>> I was wondering if there is any way to control what kind of documents
>>> are returned from a search. For example, lets say we have an index
>>> built from different types of documents (pdf, txt, html, etc.). Is
>>> there a way to have the first x results have a specified distribution
>>> of document types? It would be nice to have an even number of results
>>> that are from pdfs, txt files, and html files.
>>>
>>>
>>> Any help would greatly be appreciated.
>>>
>>>
>>> -Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document diversity

Posted by Paul Libbrecht <pa...@activemath.org>.

Just as you can add a query that will boost better things with a  
higher quality, you can add a query for a higher revenue.

Basically, the default operator "should" in boolean-clauses can be  
used exactly for that: do not force this query to be matched but raise  
boost if there's something that matches.

The translation of the user-query, itself, is marked as "must" (and  
inside this one is orred in all the various flavours, e.g. per  
language, phonetic...).

paul


Le 06-oct.-09 à 22:33, Michael Masters a écrit :

> My initial description may have been a little abstract. Maybe I should
> explain exactly what I'm trying to do. My company has various revenue
> channels, one of which is per click. If a user does a search, we would
> like to show results with the greatest revenue, although we don't want
> people to be able to buy all the top results. Hence, we would like to
> have some way of mixing results. The mixing of results could be based
> of potential revenue, relevancy, which revenue stream the result is
> associated with, etc.
>
> The previously mentioned ideas are great btw.
>
> -Mike
>
>
> On Sat, Oct 3, 2009 at 4:25 PM, Grant Ingersoll  
> <gs...@apache.org> wrote:
>> I'm curious, can you elaborate more on the deeper use case for this?
>>
>> Perhaps just implementing faceting on doc type would be  
>> sufficient?  That
>> way users can drill in on doc type.  Alternatively, I suppose you  
>> could
>> implement a hit collector that accesses a field cache on the doc  
>> type field
>> and promotes lesser seen doc types until they are evenly  
>> represented.  Could
>> also likely write a Function query that does a similar thing.  I'd  
>> imagine
>> you need to be careful to control your memory.
>>
>> -Grant
>>
>> On Oct 1, 2009, at 12:56 PM, Michael Masters wrote:
>>
>>> I was wondering if there is any way to control what kind of  
>>> documents
>>> are returned from a search. For example, lets say we have an index
>>> built from different types of documents (pdf, txt, html, etc.). Is
>>> there a way to have the first x results have a specified  
>>> distribution
>>> of document types? It would be nice to have an even number of  
>>> results
>>> that are from pdfs, txt files, and html files.
>>>
>>>
>>> Any help would greatly be appreciated.
>>>
>>>
>>> -Mike
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>
>> --------------------------
>> Grant Ingersoll
>> http://www.lucidimagination.com/
>>
>> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
>> using
>> Solr/Lucene:
>> http://www.lucidimagination.com/search
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

Re: document diversity

Posted by Michael Masters <mm...@gmail.com>.

My initial description may have been a little abstract. Maybe I should
explain exactly what I'm trying to do. My company has various revenue
channels, one of which is per click. If a user does a search, we would
like to show results with the greatest revenue, although we don't want
people to be able to buy all the top results. Hence, we would like to
have some way of mixing results. The mixing of results could be based
of potential revenue, relevancy, which revenue stream the result is
associated with, etc.

The previously mentioned ideas are great btw.

-Mike


On Sat, Oct 3, 2009 at 4:25 PM, Grant Ingersoll <gs...@apache.org> wrote:
> I'm curious, can you elaborate more on the deeper use case for this?
>
> Perhaps just implementing faceting on doc type would be sufficient?  That
> way users can drill in on doc type.  Alternatively, I suppose you could
> implement a hit collector that accesses a field cache on the doc type field
> and promotes lesser seen doc types until they are evenly represented.  Could
> also likely write a Function query that does a similar thing.  I'd imagine
> you need to be careful to control your memory.
>
> -Grant
>
> On Oct 1, 2009, at 12:56 PM, Michael Masters wrote:
>
>> I was wondering if there is any way to control what kind of documents
>> are returned from a search. For example, lets say we have an index
>> built from different types of documents (pdf, txt, html, etc.). Is
>> there a way to have the first x results have a specified distribution
>> of document types? It would be nice to have an even number of results
>> that are from pdfs, txt files, and html files.
>>
>>
>> Any help would greatly be appreciated.
>>
>>
>> -Mike
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>
> --------------------------
> Grant Ingersoll
> http://www.lucidimagination.com/
>
> Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids) using
> Solr/Lucene:
> http://www.lucidimagination.com/search
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: document diversity

Posted by Grant Ingersoll <gs...@apache.org>.

I'm curious, can you elaborate more on the deeper use case for this?

Perhaps just implementing faceting on doc type would be sufficient?   
That way users can drill in on doc type.  Alternatively, I suppose you  
could implement a hit collector that accesses a field cache on the doc  
type field and promotes lesser seen doc types until they are evenly  
represented.  Could also likely write a Function query that does a  
similar thing.  I'd imagine you need to be careful to control your  
memory.

-Grant

On Oct 1, 2009, at 12:56 PM, Michael Masters wrote:

> I was wondering if there is any way to control what kind of documents
> are returned from a search. For example, lets say we have an index
> built from different types of documents (pdf, txt, html, etc.). Is
> there a way to have the first x results have a specified distribution
> of document types? It would be nice to have an even number of results
> that are from pdfs, txt files, and html files.
>
>
> Any help would greatly be appreciated.
>
>
> -Mike
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>

--------------------------
Grant Ingersoll
http://www.lucidimagination.com/

Search the Lucene ecosystem (Lucene/Solr/Nutch/Mahout/Tika/Droids)  
using Solr/Lucene:
http://www.lucidimagination.com/search

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org