You are viewing a plain text version of this content. The canonical link for it is here.
Posted to dev@lucene.apache.org by Yonik Seeley <ys...@gmail.com> on 2005/04/04 23:35:04 UTC
scalability w/ number of fields
I know Lucene is very scalable in many ways, but how about number of fieldnames?
We have an index using around 6000 unique fieldnames,
450,000 documents, and a total index size of 4GB. It's very
sparse... documents don't have that many fields, but the number of
different fieldtypes is huge.
An optimize of this index took about an hour (mergefactor 10, compound index)
This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
The JVM was Java5 with 2.5GB heap.
This seems very long... anyone have any insights?
We'll be running more tests to see if decreasing the number of fields
has an impact.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Yonik Seeley <ys...@gmail.com>.
Optimize performance update (with tons of indexed fields):
We had a timing bug... ignore the hour I first reported. Here are the
current numbers:
indexed_fields=6791 index_size=3.9GB optimize_time=21min
indexed_fields=3216 index_size=2.0GB optimize_time=9min
indexed_fields=2080 index_size=1.4GB optimize_time=4min
It's a little apples-to-oranges since we simply removed some of the
fields to test a lower field count (and hence the index size also goes
down).
-Yonik
On Apr 4, 2005 5:38 PM, Yonik Seeley <ys...@gmail.com> wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
>
> We have an index using around 6000 unique fieldnames,
> 450,000 documents, and a total index size of 4GB. It's very
> sparse... documents don't have that many fields, but the number of
> different fieldtypes is huge.
>
> An optimize of this index took about an hour (mergefactor 10, compound index)
> This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
> The JVM was Java5 with 2.5GB heap.
>
> This seems very long... anyone have any insights?
> We'll be running more tests to see if decreasing the number of fields
> has an impact.
>
> -Yonik
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Yonik Seeley <ys...@gmail.com>.
Thanks Doug, your previous comment led us to consider compound field
types of the form compound:"name=value". Open ended range queries
also need some manipulation for this scheme to work.
> Yes, this is an ugly hack, but it can make a huge performance
> differrence. The problem is that Lucene stores norm values in an array,
> when, in cases like yours, a sparse data structure might be more sensible.
Ahhh, "There's a norm file for each indexed field with a byte for each
document."
This obviously impacts segment merging... what about query performance?
A question regarding stored-only fields (and having say 10,000 of
those)... I notice that stored and indexed field names are both listed
in the .fnm segment file. Are there any performance critical places
in Lucene where this list is walked linearly? I assume it's loaded
into an array in memory so access by fieldnum is O(1). Also makes the
FieldNum VInts bigger, but I don't see that has having a big effect.
I'll try and keep the list informed as we get more numbers (and maybe
try out other things like generic or compound fields).
-Yonik
On Apr 6, 2005 12:28 PM, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > They are all indexed (and they all need to be under the current design).
>
> As I mentioned before, Lucene will not perform well with a large number
> of indexed fields. If these are not tokenized fields, then a simple way
> to reduce the number of indexed fields is to move the field name into
> the value. Instead of adding <fieldX, valueY> and <fieldZ, valueA>, add
> <generic, fieldX-valueY> and <generic, fieldZ-valueY>. This should
> perform quite well. You'll also need to manipulate queries accordingly.
>
> A similar method can work for tokenized fields. Simply write a
> TokenFilter that appends a field name to the front of tokens.
>
> Yes, this is an ugly hack, but it can make a huge performance
> differrence. The problem is that Lucene stores norm values in an array,
> when, in cases like yours, a sparse data structure might be more sensible.
>
> Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Doug Cutting <cu...@apache.org>.
Yonik Seeley wrote:
> They are all indexed (and they all need to be under the current design).
As I mentioned before, Lucene will not perform well with a large number
of indexed fields. If these are not tokenized fields, then a simple way
to reduce the number of indexed fields is to move the field name into
the value. Instead of adding <fieldX, valueY> and <fieldZ, valueA>, add
<generic, fieldX-valueY> and <generic, fieldZ-valueY>. This should
perform quite well. You'll also need to manipulate queries accordingly.
A similar method can work for tokenized fields. Simply write a
TokenFilter that appends a field name to the front of tokens.
Yes, this is an ugly hack, but it can make a huge performance
differrence. The problem is that Lucene stores norm values in an array,
when, in cases like yours, a sparse data structure might be more sensible.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Bill Au <bi...@gmail.com>.
The compound index structure is meant for indexes with a large number of fields.
I was watching the files in the index directory of my compound index while
it was being optimized. The IndexWriter that I used was set to use
compound file.
It looks to me that Lucene first combined all existing segments into a new
multifile segment, then it converted this multifile segment into the
compound format.
So I think the data for the entire index is actually being written to
disk twice.
Is there any way to configure Lucene to write the data once only into a compound
segment without first writing a multifile segment first?
Bill
On Apr 4, 2005 6:40 PM, Yonik Seeley <ys...@gmail.com> wrote:
> They are all indexed (and they all need to be under the current design).
>
> -Yonik
>
> On Apr 4, 2005 6:16 PM, Doug Cutting <cu...@apache.org> wrote:
> > Yonik Seeley wrote:
> > > I know Lucene is very scalable in many ways, but how about number of fieldnames?
> > >
> > > We have an index using around 6000 unique fieldnames,
> >
> > How many of these fields are indexed? At this point I would recommend
> > against having more than a handful of indexed fields. If the fields are
> > only stored, then it shouldn't make much difference.
> >
> > Doug
> >
> > ---------------------------------------------------------------------
> > To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> > For additional commands, e-mail: java-user-help@lucene.apache.org
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
>
>
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Yonik Seeley <ys...@gmail.com>.
They are all indexed (and they all need to be under the current design).
-Yonik
On Apr 4, 2005 6:16 PM, Doug Cutting <cu...@apache.org> wrote:
> Yonik Seeley wrote:
> > I know Lucene is very scalable in many ways, but how about number of fieldnames?
> >
> > We have an index using around 6000 unique fieldnames,
>
> How many of these fields are indexed? At this point I would recommend
> against having more than a handful of indexed fields. If the fields are
> only stored, then it shouldn't make much difference.
>
> Doug
>
> ---------------------------------------------------------------------
> To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
> For additional commands, e-mail: java-user-help@lucene.apache.org
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Doug Cutting <cu...@apache.org>.
Yonik Seeley wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
>
> We have an index using around 6000 unique fieldnames,
How many of these fields are indexed? At this point I would recommend
against having more than a handful of indexed fields. If the fields are
only stored, then it shouldn't make much difference.
Doug
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
scalability w/ number of fields
Posted by Yonik Seeley <ys...@gmail.com>.
Oops, sorry. First went to dev by accident.
---------- Forwarded message ----------
I know Lucene is very scalable in many ways, but how about number of fieldnames?
We have an index using around 6000 unique fieldnames,
450,000 documents, and a total index size of 4GB. It's very
sparse... documents don't have that many fields, but the number of
different fieldtypes is huge.
An optimize of this index took about an hour (mergefactor 10, compound index)
This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
The JVM was Java5 with 2.5GB heap.
This seems very long... anyone have any insights?
We'll be running more tests to see if decreasing the number of fields
has an impact.
-Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-user-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-user-help@lucene.apache.org
Re: scalability w/ number of fields
Posted by Yonik Seeley <ys...@gmail.com>.
Optimize performance update (with tons of indexed fields):
We had a timing bug... ignore the hour I first reported. Here are the
current numbers:
indexed_fields=6791 index_size=3.9GB optimize_time=21min
indexed_fields=3216 index_size=2.0GB optimize_time=9min
indexed_fields=2080 index_size=1.4GB optimize_time=4min
It's a little apples-to-oranges since we simply removed some of the
fields to test a lower field count (and hence the index size also goes
down).
-Yonik
On Apr 4, 2005 5:35 PM, Yonik Seeley <ys...@gmail.com> wrote:
> I know Lucene is very scalable in many ways, but how about number of fieldnames?
>
> We have an index using around 6000 unique fieldnames,
> 450,000 documents, and a total index size of 4GB. It's very
> sparse... documents don't have that many fields, but the number of
> different fieldtypes is huge.
>
> An optimize of this index took about an hour (mergefactor 10, compound index)
> This is on enterprise hardware (fast SCSI raid, 6GB RAM, dual 2.8GHz Xeon).
> The JVM was Java5 with 2.5GB heap.
>
> This seems very long... anyone have any insights?
> We'll be running more tests to see if decreasing the number of fields
> has an impact.
>
> -Yonik
---------------------------------------------------------------------
To unsubscribe, e-mail: java-dev-unsubscribe@lucene.apache.org
For additional commands, e-mail: java-dev-help@lucene.apache.org