You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Los Morales <mo...@hotmail.com> on 2006/10/02 20:08:40 UTC

lucene newbie question

Hi,

I'm new to Lucene and IR in general.  I'm a bit confused on the concept of 
fields.  From what I've read, a field does not have to be indexed but its 
value can be stored in an index.  Likewise a field can be indexed but its 
value is not stored in an index.  Now how can a field be searchable when its 
value is not stored in the index and vice-versa?  Again, I'm new to the 
Index/Search paradigm.  Thanks in advanced.

-los

_________________________________________________________________
Find a local pizza place, music store, museum and moreĀ…then map the best 
route!  http://local.live.com


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: A question about query syntax, has it changed?

Posted by Doron Cohen <DO...@il.ibm.com>.
The problem stems from using the query parser for searching a non tokenized
field ("book").

You can either create a term query for searching in that field, like this:
   new TermQuery(new Term("book","first title"));
Or tokenize the field "book" and keep using QueryParser.

Decision is based on how you want to search the book field. If you always
search for the entire field - TermQuery is the way. For searching within
the text of this field, i.e. when knowing only part of the book title,
tokenizing this field would be the right way.

Bill Taylor <wa...@as-st.com> wrote on 02/10/2006 20:14:16:

> I am indexing individual pages of books.
>
> I get no results from the query
>
> accurate AND book:"first title"
>
> Each lucene document which represents one page of one book gets a field
> "book" which is indexed, stored, and not tokenized to store the title
> of the book.
>
> The word "accurate" appears on page one of the book "first title" as
> well as in some other books.  I can find "accurate" alone, in which
> case, it shows up from the other books as well.  But if I try to
> restrict the search to the book named "first title," I get no hits.
>
> Am I using the wrong query syntax?  I am using lucene 2.0 and the
> documentation says the query syntax is for 1.9.
>
> When I stepped through the query, the query parser created two things
> to search for, but did not find anything.
>
> It also did not find anything when I looked for
>
> book:"first title"
>
> or
>
> book:(+"first title")
>
> Luke says that there are 50 occurrences of "first title" in the book:
> field, which is the same number as there are pages in the document, so
> I suspect I am creating the index properly but not searching it
> properly.
>
> I got the expected number of responses to
>
> page:1
>
> but when I asked for page:[1 3]
>
> it appeared to find far too many pages.
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


A question about query syntax, has it changed?

Posted by Bill Taylor <wa...@as-st.com>.
I am indexing individual pages of books.

I get no results from the query

accurate AND book:"first title"

Each lucene document which represents one page of one book gets a field 
"book" which is indexed, stored, and not tokenized to store the title 
of the book.

The word "accurate" appears on page one of the book "first title" as 
well as in some other books.  I can find "accurate" alone, in which 
case, it shows up from the other books as well.  But if I try to 
restrict the search to the book named "first title," I get no hits.

Am I using the wrong query syntax?  I am using lucene 2.0 and the 
documentation says the query syntax is for 1.9.

When I stepped through the query, the query parser created two things 
to search for, but did not find anything.

It also did not find anything when I looked for

book:"first title"

or

book:(+"first title")

Luke says that there are 50 occurrences of "first title" in the book: 
field, which is the same number as there are pages in the document, so 
I suspect I am creating the index properly but not searching it 
properly.

I got the expected number of responses to

page:1

but when I asked for page:[1 3]

it appeared to find far too many pages.


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: lucene newbie question

Posted by Erick Erickson <er...@gmail.com>.
Another Erick (note the correct spelling <G>). See below..

On 10/2/06, Los Morales <mo...@hotmail.com> wrote:
>
> Hi Erik,
>
> Thanks for the response.
>
> >Consider the index in the back of a book.  You could tear that out  and
> >still use it to tell what page something is on, but you have no  actual
> >content in hand.
> So, I guess what I'm having a hard time trying to figure out is, what's
> the
> point of having an index when you can't search/retrieve the contents of a
> field in the index since it is not stored?  Isn't the whole point of
> having
> an index is to be able to search and retrieve the contents efficiently?


Your confusion here, I think, is that you CAN search on an unstored field.
Consider a book. I want to show the user the titles of the most-relevant
books. If I store the text of the entire book, it bloats the size of the
index markedly. So, I index the text but do NOT store it. Now I can show my
titles in relevancy order (when searched over the entire text), but don't
have to pay the penalty size-wise. What I can't do in this case is
reconstruct the book from the index because I didn't store the text. But I
can search it, which is what my app requires.


Basically I'm not sure the points of UnIndexed and UnStored fields types.
> Say I use a field type "unindexed" for my SSN.  I know its stored in the
> index but how am I suppose to retrieve it?


You'd search on what you *have* indexed, get the doc (from the index), and
then read the field. Something like

String s = Hits.doc(52).get("SSN");

I'm doing this now since we have images stored with internal IDs on a
separate file system. I *never* care to allow the user to search by our
internal ID number. So I index the caption, and STORE but do not INDEX the
internal ID. We provide a page full of links (in relevancy order) and when
the user clicks on one, use the stored internal ID to fetch the right image.


As for the unstored, its like the scenario I described above... I see the
> fields in the index but I won't be able to search/retrieve it since I
> don't
> have the contents.  The "text" field type makes sense to me (with data
> being
> a String), as well as the type "keyword".
>
> Is there a scenario or scenarios you can describe where Unindexed/Unstored
> will be useful?  Thanks in advanced!


Again, you can search unstored fields. You just can't reconstruct the input
with 100% fidelity (things like stop words will be missing, and any funky
games you played during indexing will mess up an attempt to reconstruct the
data).

Hope this helps.
Erick


-los
>
>
> >From: Erik Hatcher <er...@ehatchersolutions.com>
> >Reply-To: java-user@lucene.apache.org
> >To: java-user@lucene.apache.org
> >Subject: Re: lucene newbie question
> >Date: Mon, 2 Oct 2006 14:12:25 -0400
> >
> >
> >On Oct 2, 2006, at 2:08 PM, Los Morales wrote:
> >>I'm new to Lucene and IR in general.  I'm a bit confused on the  concept
> >>of fields.  From what I've read, a field does not have to  be indexed
> but
> >>its value can be stored in an index.  Likewise a  field can be indexed
> but
> >>its value is not stored in an index.  Now  how can a field be searchable
> >>when its value is not stored in the  index and vice-versa?  Again, I'm
> new
> >>to the Index/Search  paradigm.  Thanks in advanced.
> >
> >Consider the index in the back of a book.  You could tear that out  and
> >still use it to tell what page something is on, but you have no  actual
> >content in hand.  When a field is tokenized (and therefore  implicitly
> >indexed), it is run through the specified Analyzer and the  terms emitted
> >are indexed, but the original text may or may not also  be stored in the
> >index.
> >
> >Make sense?
> >
> >       Erik
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> _________________________________________________________________
> Be seen and heard with Windows Live Messenger and Microsoft LifeCams
>
> http://clk.atdmt.com/MSN/go/msnnkwme0020000001msn/direct/01/?href=http://www.microsoft.com/hardware/digitalcommunication/default.mspx?locale=en-us&source=hmtagline
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: lucene newbie question

Posted by Doron Cohen <DO...@il.ibm.com>.
SSN actually is a common situation.

Assume you have a (relational) database with a table of products with three
columns :
- SSN, which is also a primary key for that table,
- DESCRIPTION, which has free text (i.e. unformatted text) describing the
product.
- OTHER - additional info.
Also assume you want to allow users of your application to search a product
by its description. For each product found, you intend to fetch the data on
that product from the database and display it to the users.

This can be done in the following setup:
Create a Lucene index with two fields:
- ssn - stored, but not indexed
- description - tokenized (hence indexed) but not stored.

Now the application would send the user query to Lucene, using the
description field. For each document found, the application would fetch its
ssn (which is available from the Lucene index since it was stored). Using
this ssn, the application would fetch all sorts of data on that product and
display it to the user.

There are other possible designs of course - you may want to have
additional data in the Lucene index, but this hopefully just gives the
feeling how different fields with different settings are used in an
application.

I think you would find LIA ("Lucene In Action" book) very useful.

"Los Morales" <mo...@hotmail.com> wrote on 02/10/2006 11:46:45:

> Hi Erik,
>
> Thanks for the response.
>
> >Consider the index in the back of a book.  You could tear that out  and
> >still use it to tell what page something is on, but you have no  actual
> >content in hand.
> So, I guess what I'm having a hard time trying to figure out is, what's
the
> point of having an index when you can't search/retrieve the contents of a

> field in the index since it is not stored?  Isn't the whole point of
having
> an index is to be able to search and retrieve the contents efficiently?
>
> Basically I'm not sure the points of UnIndexed and UnStored fields types.

> Say I use a field type "unindexed" for my SSN.  I know its stored in the
> index but how am I suppose to retrieve it?
> As for the unstored, its like the scenario I described above... I see the

> fields in the index but I won't be able to search/retrieve it since I
don't
> have the contents.  The "text" field type makes sense to me (with data
being
> a String), as well as the type "keyword".
>
> Is there a scenario or scenarios you can describe where
Unindexed/Unstored
> will be useful?  Thanks in advanced!
>
> -los
>
>
> >From: Erik Hatcher <er...@ehatchersolutions.com>
> >Reply-To: java-user@lucene.apache.org
> >To: java-user@lucene.apache.org
> >Subject: Re: lucene newbie question
> >Date: Mon, 2 Oct 2006 14:12:25 -0400
> >
> >
> >On Oct 2, 2006, at 2:08 PM, Los Morales wrote:
> >>I'm new to Lucene and IR in general.  I'm a bit confused on the
concept
> >>of fields.  From what I've read, a field does not have to  be indexed
but
> >>its value can be stored in an index.  Likewise a  field can be indexed
but
> >>its value is not stored in an index.  Now  how can a field be
searchable
> >>when its value is not stored in the  index and vice-versa?  Again, I'm
new
> >>to the Index/Search  paradigm.  Thanks in advanced.
> >
> >Consider the index in the back of a book.  You could tear that out  and
> >still use it to tell what page something is on, but you have no  actual
> >content in hand.  When a field is tokenized (and therefore  implicitly
> >indexed), it is run through the specified Analyzer and the  terms
emitted
> >are indexed, but the original text may or may not also  be stored in the

> >index.
> >
> >Make sense?
> >
> >   Erik
> >
> >
> >---------------------------------------------------------------------
> >To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> >For additional commands, e-mail: java-user-help@lucene.apache.org
> >
>
> _________________________________________________________________
> Be seen and heard with Windows Live Messenger and Microsoft LifeCams
> http://clk.atdmt.com/MSN/go/msnnkwme0020000001msn/direct/01/?
> href=http://www.microsoft.com/hardware/digitalcommunication/default.
> mspx?locale=en-us&source=hmtagline
>
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: lucene newbie question

Posted by Los Morales <mo...@hotmail.com>.
Hi Erik,

Thanks for the response.

>Consider the index in the back of a book.  You could tear that out  and 
>still use it to tell what page something is on, but you have no  actual 
>content in hand.
So, I guess what I'm having a hard time trying to figure out is, what's the 
point of having an index when you can't search/retrieve the contents of a 
field in the index since it is not stored?  Isn't the whole point of having 
an index is to be able to search and retrieve the contents efficiently?

Basically I'm not sure the points of UnIndexed and UnStored fields types.  
Say I use a field type "unindexed" for my SSN.  I know its stored in the 
index but how am I suppose to retrieve it?
As for the unstored, its like the scenario I described above... I see the 
fields in the index but I won't be able to search/retrieve it since I don't 
have the contents.  The "text" field type makes sense to me (with data being 
a String), as well as the type "keyword".

Is there a scenario or scenarios you can describe where Unindexed/Unstored 
will be useful?  Thanks in advanced!

-los


>From: Erik Hatcher <er...@ehatchersolutions.com>
>Reply-To: java-user@lucene.apache.org
>To: java-user@lucene.apache.org
>Subject: Re: lucene newbie question
>Date: Mon, 2 Oct 2006 14:12:25 -0400
>
>
>On Oct 2, 2006, at 2:08 PM, Los Morales wrote:
>>I'm new to Lucene and IR in general.  I'm a bit confused on the  concept 
>>of fields.  From what I've read, a field does not have to  be indexed but 
>>its value can be stored in an index.  Likewise a  field can be indexed but 
>>its value is not stored in an index.  Now  how can a field be searchable 
>>when its value is not stored in the  index and vice-versa?  Again, I'm new 
>>to the Index/Search  paradigm.  Thanks in advanced.
>
>Consider the index in the back of a book.  You could tear that out  and 
>still use it to tell what page something is on, but you have no  actual 
>content in hand.  When a field is tokenized (and therefore  implicitly 
>indexed), it is run through the specified Analyzer and the  terms emitted 
>are indexed, but the original text may or may not also  be stored in the 
>index.
>
>Make sense?
>
>	Erik
>
>
>---------------------------------------------------------------------
>To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
>For additional commands, e-mail: java-user-help@lucene.apache.org
>

_________________________________________________________________
Be seen and heard with Windows Live Messenger and Microsoft LifeCams 
http://clk.atdmt.com/MSN/go/msnnkwme0020000001msn/direct/01/?href=http://www.microsoft.com/hardware/digitalcommunication/default.mspx?locale=en-us&source=hmtagline


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: lucene newbie question

Posted by Erik Hatcher <er...@ehatchersolutions.com>.
On Oct 2, 2006, at 2:08 PM, Los Morales wrote:
> I'm new to Lucene and IR in general.  I'm a bit confused on the  
> concept of fields.  From what I've read, a field does not have to  
> be indexed but its value can be stored in an index.  Likewise a  
> field can be indexed but its value is not stored in an index.  Now  
> how can a field be searchable when its value is not stored in the  
> index and vice-versa?  Again, I'm new to the Index/Search  
> paradigm.  Thanks in advanced.

Consider the index in the back of a book.  You could tear that out  
and still use it to tell what page something is on, but you have no  
actual content in hand.  When a field is tokenized (and therefore  
implicitly indexed), it is run through the specified Analyzer and the  
terms emitted are indexed, but the original text may or may not also  
be stored in the index.

Make sense?

	Erik


---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org