You are viewing a plain text version of this content. The canonical link for it is here.

Posted to user@lucenenet.apache.org by Matt Honeycutt <mb...@gmail.com> on 2010/04/19 23:50:43 UTC

Handling "Duplicate" Search Results

Hi all,

I'm curious how others are approaching this or similar problems.  The
documents indexed by our system are flagged so that we know which ones are
very similar (in terms of textual content) to one another.  We also know
which are using different text but likely referring to the same "project".
 In our final results, we'd like to collapse things so that documents that
are "duplicates" (either in terms of content or because they refer to the
same project) are collapsed into a single result.  We don't want to remove
them from the index because the user might still want to look at them, but
we don't want them to show up in the results like everything else.

Collapsing the "duplicates" on page 1 is easy: if my target is to display 10
results, I just keep accumulating results until I have 10 "unique" results.
 Things get more complicated when the user wants to see page 2 though.  It
isn't as simple as setting the starting index = pageSize * pageNum anymore,
now we need to know exactly how many documents we skipped on the first
page.

I have a couple of thoughts on how to do this.  First, we could re-collapse
results for all pages prior to the requested page each time.  This would
produce the desired output, but it would be less efficient.

The (only) alternative I've come up with is to track the starting index for
the next page, and use it to start accumulating results for the next
request.  Since this is a web application and all queries are submitted over
HTTP GET, the next page and starting index parameters would have to be
exposed as query string parameters, which means a user could muck with those
and cause the system to produce inconsistent output.

Any thoughts?

RE: Handling "Duplicate" Search Results

Posted by Hugh Spiller <Hu...@Renishaw.com>.

Hi Matt,

I've been working on a similar web search system. After playing with the
alternatives, I ended up taking the brute-force approach of going
through all results (before and after the current page) and eliminating
duplicates, for these reasons:

- We need to be able to skip pages, e.g. start at page 5 without knowing
the start index.
- We need to know the total number of results and pages so we can
display these to the user.
- Most of our searches return under 100 results.
- Even with results sets of a thousand or more, this is still pretty
quick compared to the page load time. Try it.

(And if someone does mess with the query string, they deserve the
results they get. They can't harm you with bad queries, only their own
search experience.)

________________________________

Hugh Spiller 

-----Original Message-----
From: Matt Honeycutt [mailto:mbhoneycutt@gmail.com] 
Sent: 19 April 2010 22:51
To: lucene-net-user@lucene.apache.org
Subject: Handling "Duplicate" Search Results

Hi all,

I'm curious how others are approaching this or similar problems.  The
documents indexed by our system are flagged so that we know which ones
are
very similar (in terms of textual content) to one another.  We also know
which are using different text but likely referring to the same
"project".
 In our final results, we'd like to collapse things so that documents
that
are "duplicates" (either in terms of content or because they refer to
the
same project) are collapsed into a single result.  We don't want to
remove
them from the index because the user might still want to look at them,
but
we don't want them to show up in the results like everything else.

Collapsing the "duplicates" on page 1 is easy: if my target is to
display 10
results, I just keep accumulating results until I have 10 "unique"
results.
 Things get more complicated when the user wants to see page 2 though.
It
isn't as simple as setting the starting index = pageSize * pageNum
anymore,
now we need to know exactly how many documents we skipped on the first
page.

I have a couple of thoughts on how to do this.  First, we could
re-collapse
results for all pages prior to the requested page each time.  This would
produce the desired output, but it would be less efficient.

The (only) alternative I've come up with is to track the starting index
for
the next page, and use it to start accumulating results for the next
request.  Since this is a web application and all queries are submitted
over
HTTP GET, the next page and starting index parameters would have to be
exposed as query string parameters, which means a user could muck with
those
and cause the system to produce inconsistent output.

Any thoughts?
--------------------------------------------------------------------------------------------------
This email and any attachments are confidential and are for the use of the addressee only. If you are not the addressee, you must not use or disclose the contents to any other person. Please immediately notify the sender and delete the email. Statements and opinions expressed here may not represent those of the company. Email correspondence is monitored by the company. This information may be subject to Export Control Regulation. You are obliged to comply with such Regulations

The parent company of the Renishaw Group is Renishaw plc, registered in England no. 1106260. Registered Office: New Mills, Wotton-under-Edge, Gloucestershire, GL12 8JR, United Kingdom. Tel +44 (0) 1453 524524
--------------------------------------------------------------------------------------------------

Re: Handling "Duplicate" Search Results

Posted by Matt Honeycutt <mb...@gmail.com>.

Thanks for the feedback. I had forgotten about allowing users to jump to a
specific page, so yeah, it seems like precomputing the starting document
offsets is not going to be feasible on a larger result set (some of our
result sets are 10s of thousands of documents).  Since most users only
examine the top page or two, I can just recompute the necessary pages on
each request.  The performance will suffer if the user tries to jump to page
100 or so, but it *should* be ok for the first few pages at least.

On Tue, Apr 20, 2010 at 4:09 AM, Noel Lysaght <ly...@hotmail.com> wrote:

> Hi Mat, I think your last option is the best, actually storing the last
> index point you have read; but you have a number of options to make it more
> difficult for users to access those settings.
>
> 1. Encode and encrypt the settings if you store them on the URL.
> 2. Encode and encrypt the settings and store them as cookies.
>
> Just remember that if you are adding documents to the index at the same
> time users are reading from them your document indexes can change, so you
> would also need to cache the original search results also.
>
> Kind Regards
> Noel.
>
> --------------------------------------------------
> From: "Matt Honeycutt" <mb...@gmail.com>
> Sent: Monday, April 19, 2010 10:50 PM
>
> To: <lu...@lucene.apache.org>
> Subject: Handling "Duplicate" Search Results
>
>  Hi all,
>>
>> I'm curious how others are approaching this or similar problems.  The
>> documents indexed by our system are flagged so that we know which ones are
>> very similar (in terms of textual content) to one another.  We also know
>> which are using different text but likely referring to the same "project".
>> In our final results, we'd like to collapse things so that documents that
>> are "duplicates" (either in terms of content or because they refer to the
>> same project) are collapsed into a single result.  We don't want to remove
>> them from the index because the user might still want to look at them, but
>> we don't want them to show up in the results like everything else.
>>
>> Collapsing the "duplicates" on page 1 is easy: if my target is to display
>> 10
>> results, I just keep accumulating results until I have 10 "unique"
>> results.
>> Things get more complicated when the user wants to see page 2 though.  It
>> isn't as simple as setting the starting index = pageSize * pageNum
>> anymore,
>> now we need to know exactly how many documents we skipped on the first
>> page.
>>
>> I have a couple of thoughts on how to do this.  First, we could
>> re-collapse
>> results for all pages prior to the requested page each time.  This would
>> produce the desired output, but it would be less efficient.
>>
>> The (only) alternative I've come up with is to track the starting index
>> for
>> the next page, and use it to start accumulating results for the next
>> request.  Since this is a web application and all queries are submitted
>> over
>> HTTP GET, the next page and starting index parameters would have to be
>> exposed as query string parameters, which means a user could muck with
>> those
>> and cause the system to produce inconsistent output.
>>
>> Any thoughts?
>>
>>

Re: Handling "Duplicate" Search Results

Posted by Noel Lysaght <ly...@hotmail.com>.

Hi Mat, I think your last option is the best, actually storing the last 
index point you have read; but you have a number of options to make it more 
difficult for users to access those settings.

1. Encode and encrypt the settings if you store them on the URL.
2. Encode and encrypt the settings and store them as cookies.

Just remember that if you are adding documents to the index at the same time 
users are reading from them your document indexes can change, so you would 
also need to cache the original search results also.

Kind Regards
Noel.

--------------------------------------------------
From: "Matt Honeycutt" <mb...@gmail.com>
Sent: Monday, April 19, 2010 10:50 PM
To: <lu...@lucene.apache.org>
Subject: Handling "Duplicate" Search Results

> Hi all,
>
> I'm curious how others are approaching this or similar problems.  The
> documents indexed by our system are flagged so that we know which ones are
> very similar (in terms of textual content) to one another.  We also know
> which are using different text but likely referring to the same "project".
> In our final results, we'd like to collapse things so that documents that
> are "duplicates" (either in terms of content or because they refer to the
> same project) are collapsed into a single result.  We don't want to remove
> them from the index because the user might still want to look at them, but
> we don't want them to show up in the results like everything else.
>
> Collapsing the "duplicates" on page 1 is easy: if my target is to display 
> 10
> results, I just keep accumulating results until I have 10 "unique" 
> results.
> Things get more complicated when the user wants to see page 2 though.  It
> isn't as simple as setting the starting index = pageSize * pageNum 
> anymore,
> now we need to know exactly how many documents we skipped on the first
> page.
>
> I have a couple of thoughts on how to do this.  First, we could 
> re-collapse
> results for all pages prior to the requested page each time.  This would
> produce the desired output, but it would be less efficient.
>
> The (only) alternative I've come up with is to track the starting index 
> for
> the next page, and use it to start accumulating results for the next
> request.  Since this is a web application and all queries are submitted 
> over
> HTTP GET, the next page and starting index parameters would have to be
> exposed as query string parameters, which means a user could muck with 
> those
> and cause the system to produce inconsistent output.
>
> Any thoughts?
>