You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@nutch.apache.org by Jack Tang <hi...@gmail.com> on 2005/12/12 03:08:09 UTC

Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

Hi

The approach is great for one sigle query field. How about multi-fields?
Say I want do some recommends( or show hot search) for the event search engine
http://betherebesquare.com/.

Any great thought?

/Jack

On 9/29/05, Fredrik Andersson <fi...@gmail.com> wrote:
> Hi Jack!
>
>  I like these things to be driven by statistics rather than content of the
> index. If you run a search engine, and want any kind of feedback, you will
> at least save all queries entered. You can store these in an index or
> database, and run a Levenshtein metric on the, potentially misspelled,
> query. If my memory serves me right, a Lucene FuzzyQuery uses this metric,
> so a good approach would be to keep a Lucene index with |query,frequency|
> tuples (updated nightly, weekly, or whatever), and simply search this index
> with a FuzzyQuery with some defined similarity, and pick the most frequent
> query for suggestion.
>
>  Fredrik
>
> On 9/29/05, Jack Tang <hi...@gmail.com> wrote:
> > Hi
> >
> > I am very like Google's "Did you mean" and I notice that nutch now
> > does not provider this function.
> >
> > In this article http://today.java.net/lpt/a/211 , author Tim White
> > implemented suggestion using n-gram to generate suggestion index. Do
> > you think is it good for nutch? I mean index in nutch will be really
> > huge. Or just provide some dictionaries like jazzy(LGPL) does?
> >
> > Thanks
> > /Jack
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

Posted by Fredrik Andersson <fi...@gmail.com>.
Re!

Well, you have two choices. Either you store ONE index with (what, when,
where, frequency) or THREE indices with (what, frequency), (when, frequency)
and (where, frequency). If you choose the first approach you can just parse
down the (what, when, where) strings in to one string, and by that enabling
your application to use the FuzzyQuery on just that index. I have no idea
what area your application is in, but a thing like that would be suitable
for the site mentioned previously and also a very speedy operation with just
one query.

If you'd use multiple indices you can pinpoint which field is misspelled and
take other actions on that field (i.e different searching techniques). It
should be more precise than fuzzying up just one field instead of the all
three, which is implicitly the case in the first approach.

Hope it helps,
Fredrik

On 12/12/05, Jack Tang <hi...@gmail.com> wrote:
>
> Hi Fredrik
>
> Thanks for your reply:)
> It is true that you can recommed the top-n most popular queries on
> each indexed field. See the example:
> http://www.business.com/index.asp?p=true (please select the "Job" tab).
>
> However, I think betherebesquare.com is a bit different. I mean if my
> goal is to recommend <what><when><where> -- the multi-fields query. I
> think this recommended query is quite meaningful.
>
> Is that possible in nutch? Or something I should refer to?
>
> /Jack
>
>
> On 12/12/05, Fredrik Andersson <fi...@gmail.com> wrote:
> > Hi again, Jack.
> >
> >  I don't see the problem of saving separate statistics for each field in
> > your query? In my applications, I pass the query string down to the
> > statistics index prior to QueryParser, i.e I just save "foo bar", not
> > "field1:foo field1:bar field2:foo field2:bar". If you have a similar
> thing
> > like betherebesquare.com, it shouldn't be a problem to tuck the
> different
> > fields (name, date and location) in to three statistical indices and do
> a
> > simultaneous (threaded) lookup on the three when getting a new query, to
> > make suggestions.
> >  Speaking from experience, you might want to separate the working copy
> and
> > the live copy of this statistical index, since you will want to have
> > exclusive read-access to the live index without someone writing stuff
> > (locking it) sometimes. Each low-traffic period, copy the built-up
> > statistical index, optimize() it, and replace the current live index
> with
> > the new copy.
> >
> >  Good luck,
> >  Fredrik
> >
> >
> > On 12/12/05, Jack Tang <hi...@gmail.com> wrote:
> > > Hi
> > >
> > > The approach is great for one sigle query field. How about
> multi-fields?
> > > Say I want do some recommends( or show hot search) for the event
> search
> > engine
> > > http://betherebesquare.com/ .
> > >
> > > Any great thought?
> > >
> > > /Jack
> > >
> > > On 9/29/05, Fredrik Andersson <fi...@gmail.com> wrote:
> > > > Hi Jack!
> > > >
> > > >  I like these things to be driven by statistics rather than content
> of
> > the
> > > > index. If you run a search engine, and want any kind of feedback,
> you
> > will
> > > > at least save all queries entered. You can store these in an index
> or
> > > > database, and run a Levenshtein metric on the, potentially
> misspelled,
> > > > query. If my memory serves me right, a Lucene FuzzyQuery uses this
> > metric,
> > > > so a good approach would be to keep a Lucene index with
> > |query,frequency|
> > > > tuples (updated nightly, weekly, or whatever), and simply search
> this
> > index
> > > > with a FuzzyQuery with some defined similarity, and pick the most
> > frequent
> > > > query for suggestion.
> > > >
> > > >  Fredrik
> > > >
> > > > On 9/29/05, Jack Tang <himars@gmail.com > wrote:
> > > > > Hi
> > > > >
> > > > > I am very like Google's "Did you mean" and I notice that nutch now
> > > > > does not provider this function.
> > > > >
> > > > > In this article http://today.java.net/lpt/a/211 , author Tim White
> > > > > implemented suggestion using n-gram to generate suggestion index.
> Do
> > > > > you think is it good for nutch? I mean index in nutch will be
> really
> > > > > huge. Or just provide some dictionaries like jazzy(LGPL) does?
> > > > >
> > > > > Thanks
> > > > > /Jack
> > > > > --
> > > > > Keep Discovering ... ...
> > > > > http://www.jroller.com/page/jmars
> > > > >
> > > >
> > > >
> > >
> > >
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>

Re: Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

Posted by Jack Tang <hi...@gmail.com>.
Hi Fredrik

Thanks for your reply:)
It is true that you can recommed the top-n most popular queries on
each indexed field. See the example:
http://www.business.com/index.asp?p=true (please select the "Job" tab).

However, I think betherebesquare.com is a bit different. I mean if my
goal is to recommend <what><when><where> -- the multi-fields query. I
think this recommended query is quite meaningful.

Is that possible in nutch? Or something I should refer to?

/Jack


On 12/12/05, Fredrik Andersson <fi...@gmail.com> wrote:
> Hi again, Jack.
>
>  I don't see the problem of saving separate statistics for each field in
> your query? In my applications, I pass the query string down to the
> statistics index prior to QueryParser, i.e I just save "foo bar", not
> "field1:foo field1:bar field2:foo field2:bar". If you have a similar thing
> like betherebesquare.com, it shouldn't be a problem to tuck the different
> fields (name, date and location) in to three statistical indices and do a
> simultaneous (threaded) lookup on the three when getting a new query, to
> make suggestions.
>  Speaking from experience, you might want to separate the working copy and
> the live copy of this statistical index, since you will want to have
> exclusive read-access to the live index without someone writing stuff
> (locking it) sometimes. Each low-traffic period, copy the built-up
> statistical index, optimize() it, and replace the current live index with
> the new copy.
>
>  Good luck,
>  Fredrik
>
>
> On 12/12/05, Jack Tang <hi...@gmail.com> wrote:
> > Hi
> >
> > The approach is great for one sigle query field. How about multi-fields?
> > Say I want do some recommends( or show hot search) for the event search
> engine
> > http://betherebesquare.com/ .
> >
> > Any great thought?
> >
> > /Jack
> >
> > On 9/29/05, Fredrik Andersson <fi...@gmail.com> wrote:
> > > Hi Jack!
> > >
> > >  I like these things to be driven by statistics rather than content of
> the
> > > index. If you run a search engine, and want any kind of feedback, you
> will
> > > at least save all queries entered. You can store these in an index or
> > > database, and run a Levenshtein metric on the, potentially misspelled,
> > > query. If my memory serves me right, a Lucene FuzzyQuery uses this
> metric,
> > > so a good approach would be to keep a Lucene index with
> |query,frequency|
> > > tuples (updated nightly, weekly, or whatever), and simply search this
> index
> > > with a FuzzyQuery with some defined similarity, and pick the most
> frequent
> > > query for suggestion.
> > >
> > >  Fredrik
> > >
> > > On 9/29/05, Jack Tang <himars@gmail.com > wrote:
> > > > Hi
> > > >
> > > > I am very like Google's "Did you mean" and I notice that nutch now
> > > > does not provider this function.
> > > >
> > > > In this article http://today.java.net/lpt/a/211 , author Tim White
> > > > implemented suggestion using n-gram to generate suggestion index. Do
> > > > you think is it good for nutch? I mean index in nutch will be really
> > > > huge. Or just provide some dictionaries like jazzy(LGPL) does?
> > > >
> > > > Thanks
> > > > /Jack
> > > > --
> > > > Keep Discovering ... ...
> > > > http://www.jroller.com/page/jmars
> > > >
> > >
> > >
> >
> >
> > --
> > Keep Discovering ... ...
> > http://www.jroller.com/page/jmars
> >
>
>


--
Keep Discovering ... ...
http://www.jroller.com/page/jmars

Re: Hot Search! Re: Nutch Suggestion? (Google like "did you mean")

Posted by Fredrik Andersson <fi...@gmail.com>.
Hi again, Jack.

I don't see the problem of saving separate statistics for each field in your
query? In my applications, I pass the query string down to the statistics
index prior to QueryParser, i.e I just save "foo bar", not "field1:foo
field1:bar field2:foo field2:bar". If you have a similar thing like
betherebesquare.com, it shouldn't be a problem to tuck the different fields
(name, date and location) in to three statistical indices and do a
simultaneous (threaded) lookup on the three when getting a new query, to
make suggestions.
Speaking from experience, you might want to separate the working copy and
the live copy of this statistical index, since you will want to have
exclusive read-access to the live index without someone writing stuff
(locking it) sometimes. Each low-traffic period, copy the built-up
statistical index, optimize() it, and replace the current live index with
the new copy.

Good luck,
Fredrik

On 12/12/05, Jack Tang <hi...@gmail.com> wrote:
>
> Hi
>
> The approach is great for one sigle query field. How about multi-fields?
> Say I want do some recommends( or show hot search) for the event search
> engine
> http://betherebesquare.com/.
>
> Any great thought?
>
> /Jack
>
> On 9/29/05, Fredrik Andersson <fi...@gmail.com> wrote:
> > Hi Jack!
> >
> >  I like these things to be driven by statistics rather than content of
> the
> > index. If you run a search engine, and want any kind of feedback, you
> will
> > at least save all queries entered. You can store these in an index or
> > database, and run a Levenshtein metric on the, potentially misspelled,
> > query. If my memory serves me right, a Lucene FuzzyQuery uses this
> metric,
> > so a good approach would be to keep a Lucene index with
> |query,frequency|
> > tuples (updated nightly, weekly, or whatever), and simply search this
> index
> > with a FuzzyQuery with some defined similarity, and pick the most
> frequent
> > query for suggestion.
> >
> >  Fredrik
> >
> > On 9/29/05, Jack Tang <hi...@gmail.com> wrote:
> > > Hi
> > >
> > > I am very like Google's "Did you mean" and I notice that nutch now
> > > does not provider this function.
> > >
> > > In this article http://today.java.net/lpt/a/211 , author Tim White
> > > implemented suggestion using n-gram to generate suggestion index. Do
> > > you think is it good for nutch? I mean index in nutch will be really
> > > huge. Or just provide some dictionaries like jazzy(LGPL) does?
> > >
> > > Thanks
> > > /Jack
> > > --
> > > Keep Discovering ... ...
> > > http://www.jroller.com/page/jmars
> > >
> >
> >
>
>
> --
> Keep Discovering ... ...
> http://www.jroller.com/page/jmars
>