You are viewing a plain text version of this content. The canonical link for it is here.

Posted to java-user@lucene.apache.org by Mike Polzin <mi...@yahoo.com> on 2010/02/03 02:26:37 UTC

Limiting search result for web search engine

I am working on building a web search engine and I would like to build a reults page similar to what Google does. The functionality I am looking to include is what I refer to a "rolling up" sites, meaning that even if a particular site (defined by its base URL) has many relevent hits on various pages for the searches keywords, that site is only shown once in the results listing with a link to the most relevent hit on that site. What I do not want is to have one site dominate a search results page. 

Does it make sense to just do the search, get the hits list and then programatically remove the results which, although they meet the search criteria, are not as relevent? Is there a way to do this through queries?

Thanks in advance!

Mike

Re: Limiting search result for web search engine

Posted by mpolzin <mi...@yahoo.com>.

Hi thanks for the suggestion. I am relatively new to Lucene, so I have a few
more questions on this implementation. I looked at the source code for
Lucene and found the TopDocCollector class. It appears this class derives
from the HitCollector class, so I should be able to simply extend
TopDocCollector and override the Collect method to simply check to see if I
have a document with the base URL already in collected and inserted. Here is
my psuedo code changes to the Collect method:

            if (score > 0.0f)
            {

                // Do something here to get the document base URL
(doc.BaseURL)

                if ((hq.Size() < numHits || score >= minScore)  &&
collectedBaseURLArray.Contains(doc.BaseURL))
                {
                    collectedBaseURLArray.Add(doc.BaseURL);
                    totalHits++;
                    hq.Insert(new ScoreDoc(doc, score));
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain
minScore
                }
            }

Does this make sense?

How could I tell the search to use my extended version of the
TopDocCollector class? Also, how would I pull the URL from the document
inside of the loop above? I didn't see any good documentation anywhere on
how to do that. There seems to be little information out there on how to
build your own custom collector.

Thanks again,
Mike



Hayri-2 wrote:
> 
> Mike Polzin wrote:
>> I am working on building a web search engine and I would like to build a
>> reults page similar to what Google does. The functionality I am looking
>> to include is what I refer to a "rolling up" sites, meaning that even if
>> a particular site (defined by its base URL) has many relevent hits on
>> various pages for the searches keywords, that site is only shown once in
>> the results listing with a link to the most relevent hit on that site.
>> What I do not want is to have one site dominate a search results page. 
>>
>> Does it make sense to just do the search, get the hits list and then
>> programatically remove the results which, although they meet the search
>> criteria, are not as relevent? Is there a way to do this through queries?
>>
>> Thanks in advance!
>>
>> Mike
>>
>>
>>       
>>   
> Actually why don't you use another index for the search results. For 
> example you make a search and get all the results, then index those 
> documents to a RAMIndex with the matching term. Then retrieve all 
> documents in the RamIndex based on hits for the search terms once.  In 
> other terms you will filter the documents based on maximum term 
> similarity. Is it make sense for you?
> 
> 
>  
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447266.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by Hayri <vo...@gmail.com>.

Mike Polzin wrote:
> I am working on building a web search engine and I would like to build a reults page similar to what Google does. The functionality I am looking to include is what I refer to a "rolling up" sites, meaning that even if a particular site (defined by its base URL) has many relevent hits on various pages for the searches keywords, that site is only shown once in the results listing with a link to the most relevent hit on that site. What I do not want is to have one site dominate a search results page. 
>
> Does it make sense to just do the search, get the hits list and then programatically remove the results which, although they meet the search criteria, are not as relevent? Is there a way to do this through queries?
>
> Thanks in advance!
>
> Mike
>
>
>       
>   
Actually why don't you use another index for the search results. For 
example you make a search and get all the results, then index those 
documents to a RAMIndex with the matching term. Then retrieve all 
documents in the RamIndex based on hits for the search terms once.  In 
other terms you will filter the documents based on maximum term 
similarity. Is it make sense for you?


 

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by Ian Lea <ia...@gmail.com>.

Mike


Documents do not get passed to Collectors in order of highest score.
It is the job of the collector to gather the top scoring docs, as is
typically required, and implemented by TopScoreDocCollector for the
most commonly used search method calls (according to the javadocs -
read the javadocs!). Other collectors can do whatever they want with
matching documents.

On the performance issue of reading document fields for each hit,
there is a specific warning against it in the javadocs for Collector.


--
Ian.


On Thu, Feb 4, 2010 at 5:23 PM, mpolzin <mi...@yahoo.com> wrote:
>
> Ian,
>
> Yes, this makes sense, my guess is that by creating a custom collector and
> in my overridden Collect method looking up each document by the docid to get
> the base URL is going to create a fairly significant performance hit.  And
> from the sounds of your response there is no guarantee that the documents
> are being passed to the collector in order of highest score? Is that true?
>
> If they were passed in order of highest score I could implement some code to
> only perform that logic for the first x documents depending on what is being
> viewed (1 - 10; 11 - 20; etc). In most cases people never search past the
> first few pages anyhow.
>
> Of course this sort of logic could, as you say, easily be done once post
> search, which is also the simplest approach. But then again, sometimes less
> is more.
>
> Mike
>
>
> Ian Lea wrote:
>>
>> Writing a custom collector is pretty straightforward.  There is an
>> example in the javadocs for Collector.  Use it via
>> Searcher.search(query, collector) or search(query, filter, collector).
>>
>> The docid is passed to the collect() method and you can use that to
>> get at the document and thus the URL, via your searcher or index
>> reader.  But there are performance implications to doing it this way -
>> you'll be looking at the URL for all hits, not just the top n that I
>> imagine you will be displaying.  If the index is big and you'll be
>> getting lots of hits that is likely to be a problem.  FieldCache might
>> help.
>>
>> I think that I'd move your deduping logic to after the search and set
>> a limit on the number of hits that you check.  That way you'd also get
>> the best hit first.
>>
>>
>> --
>> Ian.
>>
>>
>> On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mi...@yahoo.com> wrote:
>>>
>>> I changed one line below... realized I missed the ! (NOT).. corrected in
>>> original reply.
>>>
>>>
>>>  if ((hq.Size() < numHits || score >= minScore)  &&
>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>>                {
>>>
>>> mpolzin wrote:
>>>>
>>>>
>>>>             if (score > 0.0f)
>>>>             {
>>>>
>>>>                 // Do something here to get the document base URL
>>>> (doc.BaseURL)
>>>>
>>>>                 if ((hq.Size() < numHits || score >= minScore)  &&
>>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>>>                 {
>>>>                     collectedBaseURLArray.Add(doc.BaseURL);
>>>>                     totalHits++;
>>>>                     hq.Insert(new ScoreDoc(doc, score));
>>>>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
>>>> minScore
>>>>                 }
>>>>             }
>>>>
>>>> Does this make sense?
>>>>
>>>> How could I tell the search to use my extended version of the
>>>> TopDocCollector class? Also, how would I pull the URL from the document
>>>> inside of the loop above? I didn't see any good documentation anywhere
>>>> on
>>>> how to do that. There seems to be little information out there on how to
>>>> build your own custom collector.
>>>>
>>>> Thanks again,
>>>> Mike
>>>>
>>>>
>>>> Anshum-2 wrote:
>>>>>
>>>>> Hi Mike,
>>>>> Not really through queries, but you may do this by writing a custom
>>>>> collector. You'd need some supporting data structure to mark/hash the
>>>>> occurrence of a domain in your result set.
>>>>>
>>>>> --
>>>>> Anshum Gupta
>>>>> Naukri Labs!
>>>>> http://ai-cafe.blogspot.com
>>>>>
>>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>>> distinction is yours to draw............
>>>>>
>>>>>
>>>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com>
>>>>> wrote:
>>>>>
>>>>>> I am working on building a web search engine and I would like to build
>>>>>> a
>>>>>> reults page similar to what Google does. The functionality I am
>>>>>> looking
>>>>>> to
>>>>>> include is what I refer to a "rolling up" sites, meaning that even if
>>>>>> a
>>>>>> particular site (defined by its base URL) has many relevent hits on
>>>>>> various
>>>>>> pages for the searches keywords, that site is only shown once in the
>>>>>> results
>>>>>> listing with a link to the most relevent hit on that site. What I do
>>>>>> not
>>>>>> want is to have one site dominate a search results page.
>>>>>>
>>>>>> Does it make sense to just do the search, get the hits list and then
>>>>>> programatically remove the results which, although they meet the
>>>>>> search
>>>>>> criteria, are not as relevent? Is there a way to do this through
>>>>>> queries?
>>>>>>
>>>>>> Thanks in advance!
>>>>>>
>>>>>> Mike
>>>>>>
>>>>>>
>>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>> --
>>> View this message in context:
>>> http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
>>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>>
>>>
>>> ---------------------------------------------------------------------
>>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>>
>>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27456444.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by mpolzin <mi...@yahoo.com>.

Ian,

Yes, this makes sense, my guess is that by creating a custom collector and
in my overridden Collect method looking up each document by the docid to get
the base URL is going to create a fairly significant performance hit.  And
from the sounds of your response there is no guarantee that the documents
are being passed to the collector in order of highest score? Is that true?

If they were passed in order of highest score I could implement some code to
only perform that logic for the first x documents depending on what is being
viewed (1 - 10; 11 - 20; etc). In most cases people never search past the
first few pages anyhow.

Of course this sort of logic could, as you say, easily be done once post
search, which is also the simplest approach. But then again, sometimes less
is more.

Mike


Ian Lea wrote:
> 
> Writing a custom collector is pretty straightforward.  There is an
> example in the javadocs for Collector.  Use it via
> Searcher.search(query, collector) or search(query, filter, collector).
> 
> The docid is passed to the collect() method and you can use that to
> get at the document and thus the URL, via your searcher or index
> reader.  But there are performance implications to doing it this way -
> you'll be looking at the URL for all hits, not just the top n that I
> imagine you will be displaying.  If the index is big and you'll be
> getting lots of hits that is likely to be a problem.  FieldCache might
> help.
> 
> I think that I'd move your deduping logic to after the search and set
> a limit on the number of hits that you check.  That way you'd also get
> the best hit first.
> 
> 
> --
> Ian.
> 
> 
> On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mi...@yahoo.com> wrote:
>>
>> I changed one line below... realized I missed the ! (NOT).. corrected in
>> original reply.
>>
>>
>>  if ((hq.Size() < numHits || score >= minScore)  &&
>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>                {
>>
>> mpolzin wrote:
>>>
>>>
>>>             if (score > 0.0f)
>>>             {
>>>
>>>                 // Do something here to get the document base URL
>>> (doc.BaseURL)
>>>
>>>                 if ((hq.Size() < numHits || score >= minScore)  &&
>>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>>                 {
>>>                     collectedBaseURLArray.Add(doc.BaseURL);
>>>                     totalHits++;
>>>                     hq.Insert(new ScoreDoc(doc, score));
>>>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
>>> minScore
>>>                 }
>>>             }
>>>
>>> Does this make sense?
>>>
>>> How could I tell the search to use my extended version of the
>>> TopDocCollector class? Also, how would I pull the URL from the document
>>> inside of the loop above? I didn't see any good documentation anywhere
>>> on
>>> how to do that. There seems to be little information out there on how to
>>> build your own custom collector.
>>>
>>> Thanks again,
>>> Mike
>>>
>>>
>>> Anshum-2 wrote:
>>>>
>>>> Hi Mike,
>>>> Not really through queries, but you may do this by writing a custom
>>>> collector. You'd need some supporting data structure to mark/hash the
>>>> occurrence of a domain in your result set.
>>>>
>>>> --
>>>> Anshum Gupta
>>>> Naukri Labs!
>>>> http://ai-cafe.blogspot.com
>>>>
>>>> The facts expressed here belong to everybody, the opinions to me. The
>>>> distinction is yours to draw............
>>>>
>>>>
>>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com>
>>>> wrote:
>>>>
>>>>> I am working on building a web search engine and I would like to build
>>>>> a
>>>>> reults page similar to what Google does. The functionality I am
>>>>> looking
>>>>> to
>>>>> include is what I refer to a "rolling up" sites, meaning that even if
>>>>> a
>>>>> particular site (defined by its base URL) has many relevent hits on
>>>>> various
>>>>> pages for the searches keywords, that site is only shown once in the
>>>>> results
>>>>> listing with a link to the most relevent hit on that site. What I do
>>>>> not
>>>>> want is to have one site dominate a search results page.
>>>>>
>>>>> Does it make sense to just do the search, get the hits list and then
>>>>> programatically remove the results which, although they meet the
>>>>> search
>>>>> criteria, are not as relevent? Is there a way to do this through
>>>>> queries?
>>>>>
>>>>> Thanks in advance!
>>>>>
>>>>> Mike
>>>>>
>>>>>
>>>>>
>>>>
>>>>
>>>
>>>
>>
>> --
>> View this message in context:
>> http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
>> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>>
>>
>> ---------------------------------------------------------------------
>> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>> For additional commands, e-mail: java-user-help@lucene.apache.org
>>
>>
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27456444.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by Ian Lea <ia...@gmail.com>.

Writing a custom collector is pretty straightforward.  There is an
example in the javadocs for Collector.  Use it via
Searcher.search(query, collector) or search(query, filter, collector).

The docid is passed to the collect() method and you can use that to
get at the document and thus the URL, via your searcher or index
reader.  But there are performance implications to doing it this way -
you'll be looking at the URL for all hits, not just the top n that I
imagine you will be displaying.  If the index is big and you'll be
getting lots of hits that is likely to be a problem.  FieldCache might
help.

I think that I'd move your deduping logic to after the search and set
a limit on the number of hits that you check.  That way you'd also get
the best hit first.


--
Ian.


On Thu, Feb 4, 2010 at 5:23 AM, mpolzin <mi...@yahoo.com> wrote:
>
> I changed one line below... realized I missed the ! (NOT).. corrected in
> original reply.
>
>
>  if ((hq.Size() < numHits || score >= minScore)  &&
> !collectedBaseURLArray.Contains(doc.BaseURL))
>                {
>
> mpolzin wrote:
>>
>>
>>             if (score > 0.0f)
>>             {
>>
>>                 // Do something here to get the document base URL
>> (doc.BaseURL)
>>
>>                 if ((hq.Size() < numHits || score >= minScore)  &&
>> !collectedBaseURLArray.Contains(doc.BaseURL))
>>                 {
>>                     collectedBaseURLArray.Add(doc.BaseURL);
>>                     totalHits++;
>>                     hq.Insert(new ScoreDoc(doc, score));
>>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
>> minScore
>>                 }
>>             }
>>
>> Does this make sense?
>>
>> How could I tell the search to use my extended version of the
>> TopDocCollector class? Also, how would I pull the URL from the document
>> inside of the loop above? I didn't see any good documentation anywhere on
>> how to do that. There seems to be little information out there on how to
>> build your own custom collector.
>>
>> Thanks again,
>> Mike
>>
>>
>> Anshum-2 wrote:
>>>
>>> Hi Mike,
>>> Not really through queries, but you may do this by writing a custom
>>> collector. You'd need some supporting data structure to mark/hash the
>>> occurrence of a domain in your result set.
>>>
>>> --
>>> Anshum Gupta
>>> Naukri Labs!
>>> http://ai-cafe.blogspot.com
>>>
>>> The facts expressed here belong to everybody, the opinions to me. The
>>> distinction is yours to draw............
>>>
>>>
>>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com> wrote:
>>>
>>>> I am working on building a web search engine and I would like to build a
>>>> reults page similar to what Google does. The functionality I am looking
>>>> to
>>>> include is what I refer to a "rolling up" sites, meaning that even if a
>>>> particular site (defined by its base URL) has many relevent hits on
>>>> various
>>>> pages for the searches keywords, that site is only shown once in the
>>>> results
>>>> listing with a link to the most relevent hit on that site. What I do not
>>>> want is to have one site dominate a search results page.
>>>>
>>>> Does it make sense to just do the search, get the hits list and then
>>>> programatically remove the results which, although they meet the search
>>>> criteria, are not as relevent? Is there a way to do this through
>>>> queries?
>>>>
>>>> Thanks in advance!
>>>>
>>>> Mike
>>>>
>>>>
>>>>
>>>
>>>
>>
>>
>
> --
> View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
> Sent from the Lucene - Java Users mailing list archive at Nabble.com.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by mpolzin <mi...@yahoo.com>.

I changed one line below... realized I missed the ! (NOT).. corrected in
original reply.


 if ((hq.Size() < numHits || score >= minScore)  &&
!collectedBaseURLArray.Contains(doc.BaseURL)) 
                { 

mpolzin wrote:
> 
> 
>             if (score > 0.0f) 
>             { 
> 
>                 // Do something here to get the document base URL
> (doc.BaseURL) 
> 
>                 if ((hq.Size() < numHits || score >= minScore)  &&
> !collectedBaseURLArray.Contains(doc.BaseURL)) 
>                 { 
>                     collectedBaseURLArray.Add(doc.BaseURL); 
>                     totalHits++; 
>                     hq.Insert(new ScoreDoc(doc, score)); 
>                     minScore = ((ScoreDoc) hq.Top()).score; // maintain
> minScore 
>                 } 
>             } 
> 
> Does this make sense? 
> 
> How could I tell the search to use my extended version of the
> TopDocCollector class? Also, how would I pull the URL from the document
> inside of the loop above? I didn't see any good documentation anywhere on
> how to do that. There seems to be little information out there on how to
> build your own custom collector. 
> 
> Thanks again, 
> Mike 
> 
> 
> Anshum-2 wrote:
>> 
>> Hi Mike,
>> Not really through queries, but you may do this by writing a custom
>> collector. You'd need some supporting data structure to mark/hash the
>> occurrence of a domain in your result set.
>> 
>> --
>> Anshum Gupta
>> Naukri Labs!
>> http://ai-cafe.blogspot.com
>> 
>> The facts expressed here belong to everybody, the opinions to me. The
>> distinction is yours to draw............
>> 
>> 
>> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com> wrote:
>> 
>>> I am working on building a web search engine and I would like to build a
>>> reults page similar to what Google does. The functionality I am looking
>>> to
>>> include is what I refer to a "rolling up" sites, meaning that even if a
>>> particular site (defined by its base URL) has many relevent hits on
>>> various
>>> pages for the searches keywords, that site is only shown once in the
>>> results
>>> listing with a link to the most relevent hit on that site. What I do not
>>> want is to have one site dominate a search results page.
>>>
>>> Does it make sense to just do the search, get the hits list and then
>>> programatically remove the results which, although they meet the search
>>> criteria, are not as relevent? Is there a way to do this through
>>> queries?
>>>
>>> Thanks in advance!
>>>
>>> Mike
>>>
>>>
>>>
>> 
>> 
> 
> 

-- 
View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447903.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by mpolzin <mi...@yahoo.com>.

Hi thanks for the suggestion. I am relatively new to Lucene, so I have a few
more questions on this implementation. I looked at the source code for
Lucene and found the TopDocCollector class. It appears this class derives
from the HitCollector class, so I should be able to simply extend
TopDocCollector and override the Collect method to simply check to see if I
have a document with the base URL already in collected and inserted. Here is
my psuedo code changes to the Collect method: 

            if (score > 0.0f) 
            { 

                // Do something here to get the document base URL
(doc.BaseURL) 

                if ((hq.Size() < numHits || score >= minScore)  &&
collectedBaseURLArray.Contains(doc.BaseURL)) 
                { 
                    collectedBaseURLArray.Add(doc.BaseURL); 
                    totalHits++; 
                    hq.Insert(new ScoreDoc(doc, score)); 
                    minScore = ((ScoreDoc) hq.Top()).score; // maintain
minScore 
                } 
            } 

Does this make sense? 

How could I tell the search to use my extended version of the
TopDocCollector class? Also, how would I pull the URL from the document
inside of the loop above? I didn't see any good documentation anywhere on
how to do that. There seems to be little information out there on how to
build your own custom collector. 

Thanks again, 
Mike 


Anshum-2 wrote:
> 
> Hi Mike,
> Not really through queries, but you may do this by writing a custom
> collector. You'd need some supporting data structure to mark/hash the
> occurrence of a domain in your result set.
> 
> --
> Anshum Gupta
> Naukri Labs!
> http://ai-cafe.blogspot.com
> 
> The facts expressed here belong to everybody, the opinions to me. The
> distinction is yours to draw............
> 
> 
> On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com> wrote:
> 
>> I am working on building a web search engine and I would like to build a
>> reults page similar to what Google does. The functionality I am looking
>> to
>> include is what I refer to a "rolling up" sites, meaning that even if a
>> particular site (defined by its base URL) has many relevent hits on
>> various
>> pages for the searches keywords, that site is only shown once in the
>> results
>> listing with a link to the most relevent hit on that site. What I do not
>> want is to have one site dominate a search results page.
>>
>> Does it make sense to just do the search, get the hits list and then
>> programatically remove the results which, although they meet the search
>> criteria, are not as relevent? Is there a way to do this through queries?
>>
>> Thanks in advance!
>>
>> Mike
>>
>>
>>
> 
> 

-- 
View this message in context: http://old.nabble.com/Limiting-search-result-for-web-search-engine-tp27430155p27447319.html
Sent from the Lucene - Java Users mailing list archive at Nabble.com.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: Limiting search result for web search engine

Posted by Anshum <an...@gmail.com>.

Hi Mike,
Not really through queries, but you may do this by writing a custom
collector. You'd need some supporting data structure to mark/hash the
occurrence of a domain in your result set.

--
Anshum Gupta
Naukri Labs!
http://ai-cafe.blogspot.com

The facts expressed here belong to everybody, the opinions to me. The
distinction is yours to draw............

On Wed, Feb 3, 2010 at 6:56 AM, Mike Polzin <mi...@yahoo.com> wrote:

> I am working on building a web search engine and I would like to build a
> reults page similar to what Google does. The functionality I am looking to
> include is what I refer to a "rolling up" sites, meaning that even if a
> particular site (defined by its base URL) has many relevent hits on various
> pages for the searches keywords, that site is only shown once in the results
> listing with a link to the most relevent hit on that site. What I do not
> want is to have one site dominate a search results page.
>
> Does it make sense to just do the search, get the hits list and then
> programatically remove the results which, although they meet the search
> criteria, are not as relevent? Is there a way to do this through queries?
>
> Thanks in advance!
>
> Mike
>
>
>