You are viewing a plain text version of this content. The canonical link for it is here.

Posted to dev@lucene.apache.org by Yonik Seeley <ys...@gmail.com> on 2005/04/04 23:35:04 UTC

scalability w/ number of fields

I know Lucene is very scalable in many ways, but how about number of fieldnames?

We have an index using around 6000 unique fieldnames,
450,000 documents, and a total index size of 4GB.   It's very
sparse... documents don't have that many fields, but the number of
different fieldtypes is huge.

An optimize of this index took about an hour (mergefactor 10, compound index)
This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
The JVM was Java5 with 2.5GB heap.

This seems very long... anyone have any insights?
We'll be running more tests to see if decreasing the number of fields
has an impact.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Yonik Seeley <ys...@gmail.com>.

Optimize performance update (with tons of indexed fields):

We had a timing bug... ignore the hour I first reported.  Here are the
current numbers:

indexed_fields=6791  index_size=3.9GB  optimize_time=21min
indexed_fields=3216  index_size=2.0GB  optimize_time=9min
indexed_fields=2080  index_size=1.4GB  optimize_time=4min

It's a little apples-to-oranges since we simply removed some of the
fields to test a lower field count (and hence the index size also goes
down).

-Yonik

On Apr 4, 2005 5:38 PM, Yonik Seeley <ys...@gmail.com> wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
> 
> We have an index using around 6000 unique fieldnames,
> 450,000 documents, and a total index size of 4GB.   It's very
> sparse... documents don't have that many fields, but the number of
> different fieldtypes is huge.
> 
> An optimize of this index took about an hour (mergefactor 10, compound index)
> This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
> The JVM was Java5 with 2.5GB heap.
> 
> This seems very long... anyone have any insights?
> We'll be running more tests to see if decreasing the number of fields
> has an impact.
> 
> -Yonik
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Yonik Seeley <ys...@gmail.com>.

Thanks Doug, your previous comment led us to consider compound field
types of the form compound:"name=value".  Open ended range queries
also need some manipulation for this scheme to work.

> Yes, this is an ugly hack, but it can make a huge performance
> differrence.  The problem is that Lucene stores norm values in an array,
> when, in cases like yours, a sparse data structure might be more sensible.

Ahhh, "There's a norm file for each indexed field with a byte for each
document."
This obviously impacts segment merging... what about query performance? 

A question regarding stored-only fields (and having say 10,000 of
those)... I notice that stored and indexed field names are both listed
in the .fnm segment file.  Are there any performance critical places
in Lucene where this list is walked linearly?  I assume it's loaded
into an array in memory so access by fieldnum is O(1).  Also makes the
FieldNum VInts bigger, but I don't see that has having a big effect.

I'll try and keep the list informed as we get more numbers (and maybe
try out other things like generic or compound fields).

-Yonik

On Apr 6, 2005 12:28 PM, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > They are all indexed (and they all need to be under the current design).
> 
> As I mentioned before, Lucene will not perform well with a large number
> of indexed fields.  If these are not tokenized fields, then a simple way
> to reduce the number of indexed fields is to move the field name into
> the value.  Instead of adding <fieldX, valueY> and <fieldZ, valueA>, add
> <generic, fieldX-valueY> and <generic, fieldZ-valueY>.  This should
> perform quite well.  You'll also need to manipulate queries accordingly.
> 
> A similar method can work for tokenized fields.  Simply write a
> TokenFilter that appends a field name to the front of tokens.
> 
> Yes, this is an ugly hack, but it can make a huge performance
> differrence.  The problem is that Lucene stores norm values in an array,
> when, in cases like yours, a sparse data structure might be more sensible.
> 
> Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> They are all indexed (and they all need to be under the current design).

As I mentioned before, Lucene will not perform well with a large number 
of indexed fields.  If these are not tokenized fields, then a simple way 
to reduce the number of indexed fields is to move the field name into 
the value.  Instead of adding <fieldX, valueY> and <fieldZ, valueA>, add 
<generic, fieldX-valueY> and <generic, fieldZ-valueY>.  This should 
perform quite well.  You'll also need to manipulate queries accordingly.

A similar method can work for tokenized fields.  Simply write a 
TokenFilter that appends a field name to the front of tokens.

Yes, this is an ugly hack, but it can make a huge performance 
differrence.  The problem is that Lucene stores norm values in an array, 
when, in cases like yours, a sparse data structure might be more sensible.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Bill Au <bi...@gmail.com>.

The compound index structure is meant for indexes with a large number of fields.
I was watching the files in the index directory of my compound index while
it was being optimized.  The IndexWriter that I used was set to use
compound file.
It looks to me that Lucene first combined all existing segments into a new
multifile segment, then it converted this multifile segment into the
compound format.
So I think the data for the entire index is actually being written to
disk twice.
Is there any way to configure Lucene to write the data once only into a compound
segment without first writing a multifile segment first?

Bill

On Apr 4, 2005 6:40 PM, Yonik Seeley <ys...@gmail.com> wrote:
> They are all indexed (and they all need to be under the current design).
> 
> -Yonik
> 
> On Apr 4, 2005 6:16 PM, Doug Cutting <cu...@apache.org> wrote:
> > Yonik Seeley wrote:
> > > I know Lucene is very scalable in many ways, but how about number of fieldnames?
> > >
> > > We have an index using around 6000 unique fieldnames,
> >
> > How many of these fields are indexed?  At this point I would recommend
> > against having more than a handful of indexed fields.  If the fields are
> > only stored, then it shouldn't make much difference.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
> 
>

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Yonik Seeley <ys...@gmail.com>.

They are all indexed (and they all need to be under the current design).

-Yonik

On Apr 4, 2005 6:16 PM, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > I know Lucene is very scalable in many ways, but how about number of fieldnames?
> >
> > We have an index using around 6000 unique fieldnames,
> 
> How many of these fields are indexed?  At this point I would recommend
> against having more than a handful of indexed fields.  If the fields are
> only stored, then it shouldn't make much difference.
> 
> Doug
> 
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Doug Cutting <cu...@apache.org>.

Yonik Seeley wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
> 
> We have an index using around 6000 unique fieldnames,

How many of these fields are indexed?  At this point I would recommend 
against having more than a handful of indexed fields.  If the fields are 
only stored, then it shouldn't make much difference.

Doug

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

scalability w/ number of fields

Posted by Yonik Seeley <ys...@gmail.com>.

Oops, sorry.  First went to dev by accident.

---------- Forwarded message ----------
I know Lucene is very scalable in many ways, but how about number of fieldnames?

We have an index using around 6000 unique fieldnames,
450,000 documents, and a total index size of 4GB.   It's very
sparse... documents don't have that many fields, but the number of
different fieldtypes is huge.

An optimize of this index took about an hour (mergefactor 10, compound index)
This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
The JVM was Java5 with 2.5GB heap.

This seems very long... anyone have any insights?
We'll be running more tests to see if decreasing the number of fields
has an impact.

-Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org

Re: scalability w/ number of fields

Posted by Yonik Seeley <ys...@gmail.com>.

Optimize performance update (with tons of indexed fields):

We had a timing bug... ignore the hour I first reported.  Here are the
current numbers:

indexed_fields=6791  index_size=3.9GB  optimize_time=21min
indexed_fields=3216  index_size=2.0GB  optimize_time=9min
indexed_fields=2080  index_size=1.4GB  optimize_time=4min

It's a little apples-to-oranges since we simply removed some of the
fields to test a lower field count (and hence the index size also goes
down).

-Yonik


On Apr 4, 2005 5:35 PM, Yonik Seeley <ys...@gmail.com> wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
> 
> We have an index using around 6000 unique fieldnames,
> 450,000 documents, and a total index size of 4GB.   It's very
> sparse... documents don't have that many fields, but the number of
> different fieldtypes is huge.
> 
> An optimize of this index took about an hour (mergefactor 10, compound index)
> This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
> The JVM was Java5 with 2.5GB heap.
> 
> This seems very long... anyone have any insights?
> We'll be running more tests to see if decreasing the number of fields
> has an impact.
> 
> -Yonik

---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org