You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Daniel Rosher <ro...@googlemail.com> on 2007/03/16 18:15:10 UTC

Re: Announcement: Lucene powering Monster job search index (Beta)

Hi Peter,

Shouldn't the search perform the euclidean distance during filtering as well
though, otherwise you will obtain perhaps highly relevant hits reported to
the user outside the range they specified? Particularly as the search radius
gets larger.

Cheers,
Dan

On 1/28/07, Peter Keegan <pe...@gmail.com> wrote:
>
> Correction:
> We only do the euclidan computation during sorting. For filtering, a
> simple
> bounding box is computed to approximate the radius, and 2 range
> comparisons
> are made to exclude documents. Because these comparisons are done outside
> of
> Lucene as integer comparisons, it is pretty fast. With 13000 results, the
> seach time with distance sort is about 200 msec (compared to 30 ms for a
> simple non-radius, date-sorted keyword search).
>
> Peter
>
> On 1/27/07, no spam <mr...@gmail.com> wrote:
> >
> > Isn't this extremely ineffecient to do the euclidean distance twice?
> > Perhaps not a huge deal if a small search result set.  I at times have
> > 13,000 results that match my search terms of an index with 1.2 million
> > docs.
> >
> > Can't you do some simple radian math first to ensure it's way out of
> > bounds,
> > then do the euclidian distance for the subset within bounds?  I'm
> > currently
> > only doing the distance calc once (post hit collector). I don't have any
> > performance numbers with the double vs single distance calc.
> >
> > I'm still working out the sort by radius myself.
> >
> > Mark
> >
> > On 11/3/06, Peter Keegan <pe...@gmail.com> wrote:
> > >
> > > Daniel,
> > > Yes, this is correct if you happen to be doing a radius search and
> > sorting
> > > by mileage.
> > > Peter
> > >
> > >
> >
> >
>

Re: Announcement: Lucene powering Monster job search index (Beta)

Posted by Peter Keegan <pe...@gmail.com>.
Dan,

The filtering is done in the HitCollector by the bounding box, so the only
hits that get collected are those that match the keywords, the bounding box,
and some Lucene filters (BitSets) (I'm probably overloading the word
'filter' a bit). So, the only hits from the collector that need to be sorted
are those that are roughly within the search radius. When the search radius
gets larger, a new bounding box is computed for that query. Make sense?

Peter

On 3/16/07, Daniel Rosher <ro...@googlemail.com> wrote:
>
> Hi Peter,
>
> Shouldn't the search perform the euclidean distance during filtering as
> well
> though, otherwise you will obtain perhaps highly relevant hits reported to
> the user outside the range they specified? Particularly as the search
> radius
> gets larger.
>
> Cheers,
> Dan
>
> On 1/28/07, Peter Keegan <pe...@gmail.com> wrote:
> >
> > Correction:
> > We only do the euclidan computation during sorting. For filtering, a
> > simple
> > bounding box is computed to approximate the radius, and 2 range
> > comparisons
> > are made to exclude documents. Because these comparisons are done
> outside
> > of
> > Lucene as integer comparisons, it is pretty fast. With 13000 results,
> the
> > seach time with distance sort is about 200 msec (compared to 30 ms for a
> > simple non-radius, date-sorted keyword search).
> >
> > Peter
> >
> > On 1/27/07, no spam <mr...@gmail.com> wrote:
> > >
> > > Isn't this extremely ineffecient to do the euclidean distance twice?
> > > Perhaps not a huge deal if a small search result set.  I at times have
> > > 13,000 results that match my search terms of an index with 1.2 million
> > > docs.
> > >
> > > Can't you do some simple radian math first to ensure it's way out of
> > > bounds,
> > > then do the euclidian distance for the subset within bounds?  I'm
> > > currently
> > > only doing the distance calc once (post hit collector). I don't have
> any
> > > performance numbers with the double vs single distance calc.
> > >
> > > I'm still working out the sort by radius myself.
> > >
> > > Mark
> > >
> > > On 11/3/06, Peter Keegan <pe...@gmail.com> wrote:
> > > >
> > > > Daniel,
> > > > Yes, this is correct if you happen to be doing a radius search and
> > > sorting
> > > > by mileage.
> > > > Peter
> > > >
> > > >
> > >
> > >
> >
>