You are viewing a plain text version of this content. The canonical link for it is here.
Posted to java-user@lucene.apache.org by Anshul jain <an...@epfl.ch> on 2009/01/21 14:39:39 UTC

Lucene Performance issue

Hi,

I've indexed around half a million XML documents. Here is the document
sample:

 <a:attribute>

<a:name>cogito:Name</a:name>

<a:value>Alexander the Great</a:value>

</a:attribute>



<a:attribute>

<a:name>cogito:domain</a:name>

<a:value>ancient history</a:value>

</a:attribute>



<a:attribute>

<a:name>cogito:first_sentence</a:name>

<a:value>

Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10 323
BC), also known as Alexander III, was an ancient Greek king (basileus) of
Macedon (336-323 BC).

</a:value>

</a:attribute>


Average size of documents is around 4KB.

There are a few performance issues I need help with. When I index documents,
in a structured manner, using field information like:
name: alexander the great
domain: ancient history
first_sentence: Alexander the Great (Greek: or Megas Alexandros; July 20 356
BC June 10 323 BC), also known as Alexander III, was an ancient Greek king
(basileus) of Macedon (336-323 BC).
bagOfWords: alexander the great ancient history Alexander the Great (Greek:
or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
III, was an ancient Greek king (basileus) of Macedon (336-323 BC).

bagOfWords is the field with all the text appended to it.

I get the index size of 4.5 GB, but if I just append the text and store in
one field
like:
value: alexander the great ancient history Alexander the Great (Greek: or
Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
III, was an ancient Greek king (basileus) of Macedon (336-323 BC).

 the index size is only 700 MB.. why is this happening?



Also the query execution time of MultiFieldQueries is very slow, it is 20
times slower than single field query. Is it normal,  what could be the
reason for that?

Thanks,
Cheers,
Anshul

-- 
Anshul Jain

Re: Lucene Performance issue

Posted by Erick Erickson <er...@gmail.com>.
I agree with Ian that these times sound way too high. I'd
also ask whether you fire a few warmup searches at your
server before measuring the increased time, you might
just be seeing the cache being populated.

Best
Erick

On Wed, Jan 21, 2009 at 10:42 AM, Ian Lea <ia...@gmail.com> wrote:

> Hi
>
>
> Space: 700Mb vs 4.5Gb sounds way too big a difference.  Are you sure
> you aren't loading multiple copies of the data or something like that?
>
> Queries: a 20 times slowdown for a multi field query also sounds way
> too big.  What do the simple and multi field queries look like?
>
>
>
> --
> Ian.
>
>
> On Wed, Jan 21, 2009 at 1:39 PM, Anshul jain <an...@epfl.ch> wrote:
> > Hi,
> >
> > I've indexed around half a million XML documents. Here is the document
> > sample:
> >
> >  <a:attribute>
> >
> > <a:name>cogito:Name</a:name>
> >
> > <a:value>Alexander the Great</a:value>
> >
> > </a:attribute>
> >
> >
> >
> > <a:attribute>
> >
> > <a:name>cogito:domain</a:name>
> >
> > <a:value>ancient history</a:value>
> >
> > </a:attribute>
> >
> >
> >
> > <a:attribute>
> >
> > <a:name>cogito:first_sentence</a:name>
> >
> > <a:value>
> >
> > Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10
> 323
> > BC), also known as Alexander III, was an ancient Greek king (basileus) of
> > Macedon (336-323 BC).
> >
> > </a:value>
> >
> > </a:attribute>
> >
> >
> > Average size of documents is around 4KB.
> >
> > There are a few performance issues I need help with. When I index
> documents,
> > in a structured manner, using field information like:
> > name: alexander the great
> > domain: ancient history
> > first_sentence: Alexander the Great (Greek: or Megas Alexandros; July 20
> 356
> > BC June 10 323 BC), also known as Alexander III, was an ancient Greek
> king
> > (basileus) of Macedon (336-323 BC).
> > bagOfWords: alexander the great ancient history Alexander the Great
> (Greek:
> > or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as
> Alexander
> > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> >
> > bagOfWords is the field with all the text appended to it.
> >
> > I get the index size of 4.5 GB, but if I just append the text and store
> in
> > one field
> > like:
> > value: alexander the great ancient history Alexander the Great (Greek: or
> > Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
> > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> >
> >  the index size is only 700 MB.. why is this happening?
> >
> >
> >
> > Also the query execution time of MultiFieldQueries is very slow, it is 20
> > times slower than single field query. Is it normal,  what could be the
> > reason for that?
> >
> > Thanks,
> > Cheers,
> > Anshul
> >
> > --
> > Anshul Jain
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>

Re: Lucene Performance issue

Posted by Erick Erickson <er...@gmail.com>.
Note that your two queries are different unless you've
changed the default operator.

Also, your bagOfWords query is searching across your
default field for the second two terms.

Your bagOfWords is really something like

bagOfWords:Alexander OR <default field>:history OR <default field>:Macedon.

Best
Erick

On Wed, Jan 21, 2009 at 11:10 AM, Anshul jain <an...@gmail.com>wrote:

> Hi,
>
> thanks for the reply.
>
> For the document, in my last mail..
> multifieldQuery:
> name: Alexander AND domain: history AND first_sentence: Macedon
>
> Single field query:
> bagOfWords: Alexander history Macedon
>
> I can for sure say that multiple copies are not index. But the number of
> fields in which text is divided are many. Can that be a reason?
>
> How is field data stored in Index and searched? I read the document on file
> formats on lucene's web site but it was not very clear.
>
> Cheers,
> Anshul
>
> On Wed, Jan 21, 2009 at 4:42 PM, Ian Lea <ia...@gmail.com> wrote:
>
> > Hi
> >
> >
> > Space: 700Mb vs 4.5Gb sounds way too big a difference.  Are you sure
> > you aren't loading multiple copies of the data or something like that?
> >
> > Queries: a 20 times slowdown for a multi field query also sounds way
> > too big.  What do the simple and multi field queries look like?
> >
> >
> >
> > --
> > Ian.
> >
> >
> > On Wed, Jan 21, 2009 at 1:39 PM, Anshul jain <an...@epfl.ch>
> wrote:
> > > Hi,
> > >
> > > I've indexed around half a million XML documents. Here is the document
> > > sample:
> > >
> > >  <a:attribute>
> > >
> > > <a:name>cogito:Name</a:name>
> > >
> > > <a:value>Alexander the Great</a:value>
> > >
> > > </a:attribute>
> > >
> > >
> > >
> > > <a:attribute>
> > >
> > > <a:name>cogito:domain</a:name>
> > >
> > > <a:value>ancient history</a:value>
> > >
> > > </a:attribute>
> > >
> > >
> > >
> > > <a:attribute>
> > >
> > > <a:name>cogito:first_sentence</a:name>
> > >
> > > <a:value>
> > >
> > > Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10
> > 323
> > > BC), also known as Alexander III, was an ancient Greek king (basileus)
> of
> > > Macedon (336-323 BC).
> > >
> > > </a:value>
> > >
> > > </a:attribute>
> > >
> > >
> > > Average size of documents is around 4KB.
> > >
> > > There are a few performance issues I need help with. When I index
> > documents,
> > > in a structured manner, using field information like:
> > > name: alexander the great
> > > domain: ancient history
> > > first_sentence: Alexander the Great (Greek: or Megas Alexandros; July
> 20
> > 356
> > > BC June 10 323 BC), also known as Alexander III, was an ancient Greek
> > king
> > > (basileus) of Macedon (336-323 BC).
> > > bagOfWords: alexander the great ancient history Alexander the Great
> > (Greek:
> > > or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as
> > Alexander
> > > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> > >
> > > bagOfWords is the field with all the text appended to it.
> > >
> > > I get the index size of 4.5 GB, but if I just append the text and store
> > in
> > > one field
> > > like:
> > > value: alexander the great ancient history Alexander the Great (Greek:
> or
> > > Megas Alexandros; July 20 356 BC June 10 323 BC), also known as
> Alexander
> > > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> > >
> > >  the index size is only 700 MB.. why is this happening?
> > >
> > >
> > >
> > > Also the query execution time of MultiFieldQueries is very slow, it is
> 20
> > > times slower than single field query. Is it normal,  what could be the
> > > reason for that?
> > >
> > > Thanks,
> > > Cheers,
> > > Anshul
> > >
> > > --
> > > Anshul Jain
> > >
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> >
> >
>
>
> --
> Anshul Jain
>

Re: Lucene Performance issue

Posted by Anshul jain <an...@gmail.com>.
@Erick: Yes I changed the default field, it is "bagofwords" now.

@Ian: Yes both indexes were optimized, and I didn't do any deletions.
version 2.4.0

 I'll repeat the experiment, just be sure.
Mean while, do you have any document on Lucene fields? what I need to know
is how lucene is storing field information in posting list and how it is
used while searching.

Thanks,
Cheers,
Anshul

On Wed, Jan 21, 2009 at 6:05 PM, Ian Lea <ia...@gmail.com> wrote:

> > ...
> > I can for sure say that multiple copies are not index. But the number of
> > fields in which text is divided are many. Can that be a reason?
>
> Not for that amount of difference.  You may be sure that you are not
> indexing multiple copies, but I'm not.  Convince me - create 2 new
> indexes via the 2 methods, from scratch, and count the number of docs.
>  And verify the size of the indexes.  Does the multi GB one contain
> deleted docs?  Has it been optimized?
>
> > How is field data stored in Index and searched? I read the document on
> file
> > formats on lucene's web site but it was not very clear.
>
> All I need to know is that searching is extremely fast.  Have you
> taken note of Erick's suggestions and comments?
>
> What version of lucene are you using?
>
>
>
> --
> Ian.
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Anshul Jain

Re: Lucene Performance issue

Posted by Ian Lea <ia...@gmail.com>.
> ...
> I can for sure say that multiple copies are not index. But the number of
> fields in which text is divided are many. Can that be a reason?

Not for that amount of difference.  You may be sure that you are not
indexing multiple copies, but I'm not.  Convince me - create 2 new
indexes via the 2 methods, from scratch, and count the number of docs.
 And verify the size of the indexes.  Does the multi GB one contain
deleted docs?  Has it been optimized?

> How is field data stored in Index and searched? I read the document on file
> formats on lucene's web site but it was not very clear.

All I need to know is that searching is extremely fast.  Have you
taken note of Erick's suggestions and comments?

What version of lucene are you using?



--
Ian.

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org


Re: Lucene Performance issue

Posted by Anshul jain <an...@gmail.com>.
Hi,

thanks for the reply.

For the document, in my last mail..
multifieldQuery:
name: Alexander AND domain: history AND first_sentence: Macedon

Single field query:
bagOfWords: Alexander history Macedon

I can for sure say that multiple copies are not index. But the number of
fields in which text is divided are many. Can that be a reason?

How is field data stored in Index and searched? I read the document on file
formats on lucene's web site but it was not very clear.

Cheers,
Anshul

On Wed, Jan 21, 2009 at 4:42 PM, Ian Lea <ia...@gmail.com> wrote:

> Hi
>
>
> Space: 700Mb vs 4.5Gb sounds way too big a difference.  Are you sure
> you aren't loading multiple copies of the data or something like that?
>
> Queries: a 20 times slowdown for a multi field query also sounds way
> too big.  What do the simple and multi field queries look like?
>
>
>
> --
> Ian.
>
>
> On Wed, Jan 21, 2009 at 1:39 PM, Anshul jain <an...@epfl.ch> wrote:
> > Hi,
> >
> > I've indexed around half a million XML documents. Here is the document
> > sample:
> >
> >  <a:attribute>
> >
> > <a:name>cogito:Name</a:name>
> >
> > <a:value>Alexander the Great</a:value>
> >
> > </a:attribute>
> >
> >
> >
> > <a:attribute>
> >
> > <a:name>cogito:domain</a:name>
> >
> > <a:value>ancient history</a:value>
> >
> > </a:attribute>
> >
> >
> >
> > <a:attribute>
> >
> > <a:name>cogito:first_sentence</a:name>
> >
> > <a:value>
> >
> > Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10
> 323
> > BC), also known as Alexander III, was an ancient Greek king (basileus) of
> > Macedon (336-323 BC).
> >
> > </a:value>
> >
> > </a:attribute>
> >
> >
> > Average size of documents is around 4KB.
> >
> > There are a few performance issues I need help with. When I index
> documents,
> > in a structured manner, using field information like:
> > name: alexander the great
> > domain: ancient history
> > first_sentence: Alexander the Great (Greek: or Megas Alexandros; July 20
> 356
> > BC June 10 323 BC), also known as Alexander III, was an ancient Greek
> king
> > (basileus) of Macedon (336-323 BC).
> > bagOfWords: alexander the great ancient history Alexander the Great
> (Greek:
> > or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as
> Alexander
> > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> >
> > bagOfWords is the field with all the text appended to it.
> >
> > I get the index size of 4.5 GB, but if I just append the text and store
> in
> > one field
> > like:
> > value: alexander the great ancient history Alexander the Great (Greek: or
> > Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
> > III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
> >
> >  the index size is only 700 MB.. why is this happening?
> >
> >
> >
> > Also the query execution time of MultiFieldQueries is very slow, it is 20
> > times slower than single field query. Is it normal,  what could be the
> > reason for that?
> >
> > Thanks,
> > Cheers,
> > Anshul
> >
> > --
> > Anshul Jain
> >
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>


-- 
Anshul Jain

Re: Lucene Performance issue

Posted by Ian Lea <ia...@gmail.com>.
Hi


Space: 700Mb vs 4.5Gb sounds way too big a difference.  Are you sure
you aren't loading multiple copies of the data or something like that?

Queries: a 20 times slowdown for a multi field query also sounds way
too big.  What do the simple and multi field queries look like?



--
Ian.


On Wed, Jan 21, 2009 at 1:39 PM, Anshul jain <an...@epfl.ch> wrote:
> Hi,
>
> I've indexed around half a million XML documents. Here is the document
> sample:
>
>  <a:attribute>
>
> <a:name>cogito:Name</a:name>
>
> <a:value>Alexander the Great</a:value>
>
> </a:attribute>
>
>
>
> <a:attribute>
>
> <a:name>cogito:domain</a:name>
>
> <a:value>ancient history</a:value>
>
> </a:attribute>
>
>
>
> <a:attribute>
>
> <a:name>cogito:first_sentence</a:name>
>
> <a:value>
>
> Alexander the Great (Greek: or Megas Alexandros; July 20 356 BC June 10 323
> BC), also known as Alexander III, was an ancient Greek king (basileus) of
> Macedon (336-323 BC).
>
> </a:value>
>
> </a:attribute>
>
>
> Average size of documents is around 4KB.
>
> There are a few performance issues I need help with. When I index documents,
> in a structured manner, using field information like:
> name: alexander the great
> domain: ancient history
> first_sentence: Alexander the Great (Greek: or Megas Alexandros; July 20 356
> BC June 10 323 BC), also known as Alexander III, was an ancient Greek king
> (basileus) of Macedon (336-323 BC).
> bagOfWords: alexander the great ancient history Alexander the Great (Greek:
> or Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
> III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
>
> bagOfWords is the field with all the text appended to it.
>
> I get the index size of 4.5 GB, but if I just append the text and store in
> one field
> like:
> value: alexander the great ancient history Alexander the Great (Greek: or
> Megas Alexandros; July 20 356 BC June 10 323 BC), also known as Alexander
> III, was an ancient Greek king (basileus) of Macedon (336-323 BC).
>
>  the index size is only 700 MB.. why is this happening?
>
>
>
> Also the query execution time of MultiFieldQueries is very slow, it is 20
> times slower than single field query. Is it normal,  what could be the
> reason for that?
>
> Thanks,
> Cheers,
> Anshul
>
> --
> Anshul Jain
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org