You are viewing a plain text version of this content. The canonical link for it is here.

Posted to general@lucene.apache.org by Dave Jones <dh...@comcast.net> on 2011/06/17 18:44:09 UTC

setup and use scenario

Dear Lucene Wizards,

I am trying to do the following:

Given:  all items are book titles
...
Little Book
The Little Book
The Little Book for Students
The Little Book for Students, Part II
The Little Book for Students at School
...
I have millions of books in this index.  However, I am scanning text that
has the book titles in the text line along with other information.  I want
to match exactly (highest Lucene score) the "documents" listed above as book
titles.

What is the best way to set up the indices and what is the best way to
querry.  What happens when the query is for "The Little Book of High School
Students" (title not in index).

Your suggestions would be appreciated.  Also, if this type of question has
been asked before, please point me to the appropriate thread.  

Query:  "Little Book", "The Little Book for High School Students" (not in
index), etc.

Thanks for your help!

Cheers!

--
View this message in context: http://lucene.472066.n3.nabble.com/setup-and-use-scenario-tp3076997p3076997.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: setup and use scenario

Posted by Dave Jones <dh...@comcast.net>.

Ryan, thanks for the reply.

I am using Lucene and most of the defaults.  Where I am now is I am finding
that I need to boost certain terms, otherwise, I am getting the wrong
results.  For example, I would probably boost Little and Book to overcome
the scoring of the phrases with longer terms in them.

The guidance that I am looking for is what is normally used in these
situations vs. me continuing to perform trial and error experiments.  For
example, is it better to boost in the index or boost in the query?  What is
a good boost value?  I started out at 3.0 and then tried 5.0 and got better
results but picked up some small errors.  How does one find the minimum
threshold for the cutoff when the book is not there?  Is another type of
analyzer better to use in this case.  Any other settings that I should pay
attention to?

Again, thanks for the help.

--
View this message in context: http://lucene.472066.n3.nabble.com/setup-and-use-scenario-tp3076997p3080719.html
Sent from the Lucene - General mailing list archive at Nabble.com.

Re: setup and use scenario

Posted by Ryan McKinley <ry...@gmail.com>.

have you tried anything yet?  solr?  lucene?

with an OR query (the default) and standard analysis, it should just work


On Fri, Jun 17, 2011 at 12:44 PM, Dave Jones <dh...@comcast.net> wrote:
> Dear Lucene Wizards,
>
> I am trying to do the following:
>
> Given:  all items are book titles
> ...
> Little Book
> The Little Book
> The Little Book for Students
> The Little Book for Students, Part II
> The Little Book for Students at School
> ...
> I have millions of books in this index.  However, I am scanning text that
> has the book titles in the text line along with other information.  I want
> to match exactly (highest Lucene score) the "documents" listed above as book
> titles.
>
> What is the best way to set up the indices and what is the best way to
> querry.  What happens when the query is for "The Little Book of High School
> Students" (title not in index).
>
> Your suggestions would be appreciated.  Also, if this type of question has
> been asked before, please point me to the appropriate thread.
>
> Query:  "Little Book", "The Little Book for High School Students" (not in
> index), etc.
>
> Thanks for your help!
>
> Cheers!
>
> --
> View this message in context: http://lucene.472066.n3.nabble.com/setup-and-use-scenario-tp3076997p3076997.html
> Sent from the Lucene - General mailing list archive at Nabble.com.
>