You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by "Kainth, Sachin" <Sa...@atkinsglobal.com> on 2007/02/19 16:11:15 UTC

Fields

Hi all,

I have a few question regarding indexing documents.

1. With my experience of indexing documents with lucene so far I have
done things like:

Doc.Add(Field.Text("album", Album));

Where Album is a string representing an album name.  Now with this sort
of indexing what I do is a search such as:

"album:Thriller"

a) Does this mean that I cannot do an search across all fields  by
submitting the query:

"Thriller"?  In other words by submitting this query would my code
search all fields?

b) Is there a way in which I can index elements of a document without
naming the field.  What would the impact of such a use of the indexing
capabilities of Lucene be?

2. Is there a limit to the number of
a) named fields per document that I can store
b) non-named fields per document that I can store

3. 

a) Is it possible in Lucene to conduct searches that are very complex
such as:

((album = Thriller AND artist = (Michael OR Jackson)) OR (date between X
AND Y)) AND (label = sony OR Epic)   etc...

b) For such a query what are the performance penalties compared to a
simple search involving 1 term?

Cheers

Sachin



This email and any attached files are confidential and copyright protected. If you are not the addressee, any dissemination of this communication is strictly prohibited. Unless otherwise expressly agreed in writing, nothing stated in this communication shall be legally binding.

The ultimate parent company of the Atkins Group is WS Atkins plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove, Ashley Road, Epsom, Surrey KT18 5BW.

Consider the environment. Please don't print this e-mail unless you really need to. 

Re: Fields

Posted by Erick Erickson <er...@gmail.com>.
It's whatever you set it to. From the API...

QueryParser

public *QueryParser*(String
<http://java.sun.com/j2se/1.4/docs/api/java/lang/String.html> f,
                   Analyzer
<file:///C:/lucene-2.1.0/docs/api/org/apache/lucene/analysis/Analyzer.html>
a)

Constructs a query parser.

*Parameters:*f - the default field for query terms.a - used to find terms in
the query text.

Note: this is the 2.1 API (idential to 2.0 as I remember).


On 2/19/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi Erik,
>
> I looked at the QueryParser API doc but I can't seem to find what the
> default field is. Also, how would the syntax of the index code differ
> when indexing a word to the default field from this:
>
> Doc.Add(Field.Text("album", Album));
>
> Cheers
>
> Sachin
>
> -----Original Message-----
> From: Erick Erickson [mailto:erickerickson@gmail.com]
> Sent: 19 February 2007 16:05
> To: java-user@lucene.apache.org
> Subject: Re: Fields
>
> See below.
>
> On 2/19/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
> >
> > Hi all,
> >
> > I have a few question regarding indexing documents.
> >
> > 1. With my experience of indexing documents with lucene so far I have
> > done things like:
> >
> > Doc.Add(Field.Text("album", Album));
> >
> > Where Album is a string representing an album name.  Now with this
> > sort of indexing what I do is a search such as:
> >
> > "album:Thriller"
> >
> > a) Does this mean that I cannot do an search across all fields  by
> > submitting the query:
> >
> > "Thriller"?  In other words by submitting this query would my code
> > search all fields?
>
>
> No. If you just submit "Thriller", you'll only search the default field.
> See QueryParser for the default field.
>
>
> b) Is there a way in which I can index elements of a document without
> > naming the field.  What would the impact of such a use of the indexing
>
> > capabilities of Lucene be?
>
>
> I don't think this makes sense in Lucene terms. All elements in a
> document have a field. You can index everything into one field if you
> need an aggregate, which gives you this same result.
>
> Do note, however, that there's no requirement that all documents have
> the same fields.
>
>
> 2. Is there a limit to the number of
> > a) named fields per document that I can store
>
>
> I think there is, but it's absurdly high. Don't worry about this....
>
>
> b) non-named fields per document that I can store
>
>
> 0 since I don't think you can.
>
>
> 3.
> >
> > a) Is it possible in Lucene to conduct searches that are very complex
> > such as:
> >
> > ((album = Thriller AND artist = (Michael OR Jackson)) OR (date between
> X
> > AND Y)) AND (label = sony OR Epic)   etc...
>
>
> Yes
>
>
> b) For such a query what are the performance penalties compared to a
> > simple search involving 1 term?
>
>
> In the immortal words of Mr. Hatcher.. .it depends. You'll really just
> have to experiment and find out. It can probably be approximated by
> taking the sum of the individual queries as the upper limit. The real
> killer is wildcards..... The real question isn't "what is the effect on
> performance", it's "is the performance good enough for my application".
> Which varies as the characteristics of the database change.
>
> I would argue that a 1M index will process arbitrarily complex queries
> "fast enough". The same may not be true for a 100G index. So this
> question is really unanswerable in the abstract.
>
>
> Cheers
> >
> > Sachin
> >
> >
> >
> > This email and any attached files are confidential and copyright
> > protected. If you are not the addressee, any dissemination of this
> > communication is strictly prohibited. Unless otherwise expressly
> > agreed in writing, nothing stated in this communication shall be
> legally binding.
> >
> > The ultimate parent company of the Atkins Group is WS Atkins plc.
> > Registered in England No. 1885586.  Registered Office Woodcote Grove,
> > Ashley Road, Epsom, Surrey KT18 5BW.
> >
> > Consider the environment. Please don't print this e-mail unless you
> > really need to.
> >
>
>
> This message has been scanned for viruses by MailControl - (see
> http://bluepages.wsatkins.co.uk/?6875772)
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

RE: Fields

Posted by "Kainth, Sachin" <Sa...@atkinsglobal.com>.
Hi Erik,

I looked at the QueryParser API doc but I can't seem to find what the
default field is. Also, how would the syntax of the index code differ
when indexing a word to the default field from this:

Doc.Add(Field.Text("album", Album));

Cheers

Sachin

-----Original Message-----
From: Erick Erickson [mailto:erickerickson@gmail.com] 
Sent: 19 February 2007 16:05
To: java-user@lucene.apache.org
Subject: Re: Fields

See below.

On 2/19/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi all,
>
> I have a few question regarding indexing documents.
>
> 1. With my experience of indexing documents with lucene so far I have 
> done things like:
>
> Doc.Add(Field.Text("album", Album));
>
> Where Album is a string representing an album name.  Now with this 
> sort of indexing what I do is a search such as:
>
> "album:Thriller"
>
> a) Does this mean that I cannot do an search across all fields  by 
> submitting the query:
>
> "Thriller"?  In other words by submitting this query would my code 
> search all fields?


No. If you just submit "Thriller", you'll only search the default field.
See QueryParser for the default field.


b) Is there a way in which I can index elements of a document without
> naming the field.  What would the impact of such a use of the indexing

> capabilities of Lucene be?


I don't think this makes sense in Lucene terms. All elements in a
document have a field. You can index everything into one field if you
need an aggregate, which gives you this same result.

Do note, however, that there's no requirement that all documents have
the same fields.


2. Is there a limit to the number of
> a) named fields per document that I can store


I think there is, but it's absurdly high. Don't worry about this....


b) non-named fields per document that I can store


0 since I don't think you can.


3.
>
> a) Is it possible in Lucene to conduct searches that are very complex 
> such as:
>
> ((album = Thriller AND artist = (Michael OR Jackson)) OR (date between
X
> AND Y)) AND (label = sony OR Epic)   etc...


Yes


b) For such a query what are the performance penalties compared to a
> simple search involving 1 term?


In the immortal words of Mr. Hatcher.. .it depends. You'll really just
have to experiment and find out. It can probably be approximated by
taking the sum of the individual queries as the upper limit. The real
killer is wildcards..... The real question isn't "what is the effect on
performance", it's "is the performance good enough for my application".
Which varies as the characteristics of the database change.

I would argue that a 1M index will process arbitrarily complex queries
"fast enough". The same may not be true for a 100G index. So this
question is really unanswerable in the abstract.


Cheers
>
> Sachin
>
>
>
> This email and any attached files are confidential and copyright 
> protected. If you are not the addressee, any dissemination of this 
> communication is strictly prohibited. Unless otherwise expressly 
> agreed in writing, nothing stated in this communication shall be
legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins plc.  
> Registered in England No. 1885586.  Registered Office Woodcote Grove, 
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you 
> really need to.
>


This message has been scanned for viruses by MailControl - (see
http://bluepages.wsatkins.co.uk/?6875772)

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Fields

Posted by Erick Erickson <er...@gmail.com>.
See below.

On 2/19/07, Kainth, Sachin <Sa...@atkinsglobal.com> wrote:
>
> Hi all,
>
> I have a few question regarding indexing documents.
>
> 1. With my experience of indexing documents with lucene so far I have
> done things like:
>
> Doc.Add(Field.Text("album", Album));
>
> Where Album is a string representing an album name.  Now with this sort
> of indexing what I do is a search such as:
>
> "album:Thriller"
>
> a) Does this mean that I cannot do an search across all fields  by
> submitting the query:
>
> "Thriller"?  In other words by submitting this query would my code
> search all fields?


No. If you just submit "Thriller", you'll only search the default field. See
QueryParser for the default field.


b) Is there a way in which I can index elements of a document without
> naming the field.  What would the impact of such a use of the indexing
> capabilities of Lucene be?


I don't think this makes sense in Lucene terms. All elements in a document
have a field. You can index everything into one field if you need an
aggregate, which gives you this same result.

Do note, however, that there's no requirement that all documents have the
same fields.


2. Is there a limit to the number of
> a) named fields per document that I can store


I think there is, but it's absurdly high. Don't worry about this....


b) non-named fields per document that I can store


0 since I don't think you can.


3.
>
> a) Is it possible in Lucene to conduct searches that are very complex
> such as:
>
> ((album = Thriller AND artist = (Michael OR Jackson)) OR (date between X
> AND Y)) AND (label = sony OR Epic)   etc...


Yes


b) For such a query what are the performance penalties compared to a
> simple search involving 1 term?


In the immortal words of Mr. Hatcher.. .it depends. You'll really just have
to experiment and find out. It can probably be approximated by taking the
sum of the individual queries as the upper limit. The real killer is
wildcards..... The real question isn't "what is the effect on performance",
it's "is the performance good enough for my application". Which varies as
the characteristics of the database change.

I would argue that a 1M index will process arbitrarily complex queries "fast
enough". The same may not be true for a 100G index. So this question is
really unanswerable in the abstract.


Cheers
>
> Sachin
>
>
>
> This email and any attached files are confidential and copyright
> protected. If you are not the addressee, any dissemination of this
> communication is strictly prohibited. Unless otherwise expressly agreed in
> writing, nothing stated in this communication shall be legally binding.
>
> The ultimate parent company of the Atkins Group is WS Atkins
> plc.  Registered in England No. 1885586.  Registered Office Woodcote Grove,
> Ashley Road, Epsom, Surrey KT18 5BW.
>
> Consider the environment. Please don't print this e-mail unless you really
> need to.
>